{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "87258e76-c1b6-4153-a137-7be563d1b5e4",
   "metadata": {},
   "source": [
    "# Chapter 14.3 - The SD and the Normal Curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datascience import *\n",
    "import numpy as np\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4848b1e0-2e13-4068-b653-618074856d0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "song_lengths = Table.read_table(\"top_spotify_songs_usa.csv\").select(\"duration_ms\")\n",
    "song_lengths.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0223a98-7216-4ef0-9b84-bc5554f9ef84",
   "metadata": {},
   "outputs": [],
   "source": [
    "true_mean = np.mean(song_lengths.column(\"duration_ms\"))\n",
    "true_std = np.std(song_lengths.column(\"duration_ms\"))\n",
    "print(\"True Mean = {:.0f}, True STD = {:.0f}\".format(true_mean, true_std))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99a11b8d-8907-46f6-a513-6e3f4a82bae4",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_samples = 50\n",
    "repetitions = 10000\n",
    "\n",
    "average_song_lengths = make_array()\n",
    "\n",
    "for _ in np.arange(repetitions):\n",
    "    lengths = song_lengths.sample(num_samples)\n",
    "    new_average_length = int(np.round((np.mean(lengths.column('duration_ms')))))\n",
    "    average_song_lengths = np.append(average_song_lengths, new_average_length)\n",
    "\n",
    "results = Table().with_column('Average Song Length', average_song_lengths)\n",
    "results.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "316f836e-423e-4ca1-b96c-f41190efc296",
   "metadata": {},
   "outputs": [],
   "source": [
    "results.hist(\"Average Song Length\", bins = np.arange(170000, 235001, 500))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db7f0058-b435-4751-9f75-ac2e9c9e6a84",
   "metadata": {},
   "source": [
    "**Conceptual Questions**\n",
    "- How do you expect the mean of the results table to compare with the mean of the song_lengths table?\n",
    "- How do you expect the std of the results table to compare with the std of the song_lengths table?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3883d3f-bffc-4ff8-a8cb-a4a39d7f6d44",
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_of_samples = np.mean(results.column(\"Average Song Length\"))\n",
    "std_of_samples = np.std(results.column(\"Average Song Length\"))\n",
    "print(\"Mean of Samples = {:.0f}, STD of Samples = {:.0f}\".format(mean_of_samples, std_of_samples))"
   ]
  },
  {
   "cell_type": "raw",
   "id": "a8bcdd06-0125-4b9e-a8fd-957dc6b2b704",
   "metadata": {},
   "source": [
    "How will the histogram above change if the number of samples is increased to 500?  Why??"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "434cfaa6-0972-4f2f-8e7d-ab7feff2ef93",
   "metadata": {},
   "source": [
    "# Chapter 14.4 - The Central Limit Theorem"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c92d490-770a-4705-8e50-5253ea14c2aa",
   "metadata": {},
   "source": [
    "**Central Limit Theorem** - the probability distribution of the sum or average of a large random sample drawn with replacement \n",
    "will be roughly normal, *regardless of the distribution of the population from which the sample is drawn*."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34eb6edc-41d2-4cd1-aa20-69715c46e198",
   "metadata": {},
   "source": [
    "Standard Normal **cdf** (Cumulative Distribution Function)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d6578b2-8bb7-492a-8179-00b3592a804c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62f24482-6f54-4e21-8eb8-23775641a582",
   "metadata": {},
   "source": [
    "For normal distributions, the amount of data captured within 1 standard deviation of the mean is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80c57e18-9faa-42a5-aede-67886157df0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reminder: Cherbychev's bound states that it will be at least 0%\n",
    "print(\"{:.2f}%\".format(100 * (stats.norm.cdf(1) - stats.norm.cdf(-1))))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d4b55f9-7a0d-4a90-859a-38ddc97f300c",
   "metadata": {},
   "source": [
    "**Active Learning**: Use python to \n",
    "- Print the lowest value in the results table above that is within 1 STD of the mean\n",
    "- Print the highest value in the results table above that is within 1 STD of the mean.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14cb09b9-e3b0-442d-89ac-6ea4a036e1ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Place answer here."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e011da5-05ab-4b1b-af14-6d3da78addf2",
   "metadata": {},
   "source": [
    "For normal distributions, the amount of data captured within 2 standard deviations of the mean is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b55dd1d-a2a9-4fa2-acaa-7ee04cf7462e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reminder: Cherbychev's bound states that it will be at least 75%\n",
    "print(\"{:.2f}%\".format(100 * (stats.norm.cdf(2) - stats.norm.cdf(-2))))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a25f230-92e6-4221-8404-69ce1e30f06c",
   "metadata": {},
   "source": [
    "For normal distributions, the amount of data captured within 3 standard deviations of the mean is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db58c89d-02e9-4cce-991a-6cc0926426c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reminder: Cherbychev's bound states that it will be at least 88.89%\n",
    "print(\"{:.2f}%\".format(100 * (stats.norm.cdf(3) - stats.norm.cdf(-3))))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6661f9b7-d23d-4ce6-bfd2-c8690a9782a0",
   "metadata": {},
   "source": [
    "In general, for bell-shaped distributions, the SD is the distance between the mean and the \n",
    "points of inflection on either side."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}