{ "cells": [ { "cell_type": "markdown", "id": "87258e76-c1b6-4153-a137-7be563d1b5e4", "metadata": {}, "source": [ "# Chapter 14.3 - The SD and the Normal Curve" ] }, { "cell_type": "code", "execution_count": null, "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "id": "4848b1e0-2e13-4068-b653-618074856d0d", "metadata": {}, "outputs": [], "source": [ "song_lengths = Table.read_table(\"top_spotify_songs_usa.csv\").select(\"duration_ms\")\n", "song_lengths.show(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "d0223a98-7216-4ef0-9b84-bc5554f9ef84", "metadata": {}, "outputs": [], "source": [ "true_mean = np.mean(song_lengths.column(\"duration_ms\"))\n", "true_std = np.std(song_lengths.column(\"duration_ms\"))\n", "print(\"True Mean = {:.0f}, True STD = {:.0f}\".format(true_mean, true_std))" ] }, { "cell_type": "code", "execution_count": null, "id": "99a11b8d-8907-46f6-a513-6e3f4a82bae4", "metadata": {}, "outputs": [], "source": [ "num_samples = 50\n", "repetitions = 10000\n", "\n", "average_song_lengths = make_array()\n", "\n", "for _ in np.arange(repetitions):\n", " lengths = song_lengths.sample(num_samples)\n", " new_average_length = int(np.round((np.mean(lengths.column('duration_ms')))))\n", " average_song_lengths = np.append(average_song_lengths, new_average_length)\n", "\n", "results = Table().with_column('Average Song Length', average_song_lengths)\n", "results.show(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "316f836e-423e-4ca1-b96c-f41190efc296", "metadata": {}, "outputs": [], "source": [ "results.hist(\"Average Song Length\", bins = np.arange(170000, 235001, 500))" ] }, { "cell_type": "markdown", "id": "db7f0058-b435-4751-9f75-ac2e9c9e6a84", "metadata": {}, "source": [ "**Conceptual Questions**\n", "- How do you expect the mean of the results table to compare with the mean of the song_lengths table?\n", "- How do you expect the std of the results table to compare with the std of the song_lengths table?" ] }, { "cell_type": "code", "execution_count": null, "id": "c3883d3f-bffc-4ff8-a8cb-a4a39d7f6d44", "metadata": {}, "outputs": [], "source": [ "mean_of_samples = np.mean(results.column(\"Average Song Length\"))\n", "std_of_samples = np.std(results.column(\"Average Song Length\"))\n", "print(\"Mean of Samples = {:.0f}, STD of Samples = {:.0f}\".format(mean_of_samples, std_of_samples))" ] }, { "cell_type": "raw", "id": "a8bcdd06-0125-4b9e-a8fd-957dc6b2b704", "metadata": {}, "source": [ "How will the histogram above change if the number of samples is increased to 500? Why??" ] }, { "cell_type": "markdown", "id": "434cfaa6-0972-4f2f-8e7d-ab7feff2ef93", "metadata": {}, "source": [ "# Chapter 14.4 - The Central Limit Theorem" ] }, { "cell_type": "markdown", "id": "1c92d490-770a-4705-8e50-5253ea14c2aa", "metadata": {}, "source": [ "**Central Limit Theorem** - the probability distribution of the sum or average of a large random sample drawn with replacement \n", "will be roughly normal, *regardless of the distribution of the population from which the sample is drawn*." ] }, { "cell_type": "markdown", "id": "34eb6edc-41d2-4cd1-aa20-69715c46e198", "metadata": {}, "source": [ "Standard Normal **cdf** (Cumulative Distribution Function)" ] }, { "cell_type": "code", "execution_count": null, "id": "2d6578b2-8bb7-492a-8179-00b3592a804c", "metadata": {}, "outputs": [], "source": [ "from scipy import stats" ] }, { "cell_type": "markdown", "id": "62f24482-6f54-4e21-8eb8-23775641a582", "metadata": {}, "source": [ "For normal distributions, the amount of data captured within 1 standard deviation of the mean is:" ] }, { "cell_type": "code", "execution_count": null, "id": "80c57e18-9faa-42a5-aede-67886157df0e", "metadata": {}, "outputs": [], "source": [ "# Reminder: Cherbychev's bound states that it will be at least 0%\n", "print(\"{:.2f}%\".format(100 * (stats.norm.cdf(1) - stats.norm.cdf(-1))))" ] }, { "cell_type": "markdown", "id": "8d4b55f9-7a0d-4a90-859a-38ddc97f300c", "metadata": {}, "source": [ "**Active Learning**: Use python to \n", "- Print the lowest value in the results table above that is within 1 STD of the mean\n", "- Print the highest value in the results table above that is within 1 STD of the mean.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "14cb09b9-e3b0-442d-89ac-6ea4a036e1ec", "metadata": {}, "outputs": [], "source": [ "# Place answer here." ] }, { "cell_type": "markdown", "id": "3e011da5-05ab-4b1b-af14-6d3da78addf2", "metadata": {}, "source": [ "For normal distributions, the amount of data captured within 2 standard deviations of the mean is:" ] }, { "cell_type": "code", "execution_count": null, "id": "4b55dd1d-a2a9-4fa2-acaa-7ee04cf7462e", "metadata": {}, "outputs": [], "source": [ "# Reminder: Cherbychev's bound states that it will be at least 75%\n", "print(\"{:.2f}%\".format(100 * (stats.norm.cdf(2) - stats.norm.cdf(-2))))" ] }, { "cell_type": "markdown", "id": "8a25f230-92e6-4221-8404-69ce1e30f06c", "metadata": {}, "source": [ "For normal distributions, the amount of data captured within 3 standard deviations of the mean is:" ] }, { "cell_type": "code", "execution_count": null, "id": "db58c89d-02e9-4cce-991a-6cc0926426c0", "metadata": {}, "outputs": [], "source": [ "# Reminder: Cherbychev's bound states that it will be at least 88.89%\n", "print(\"{:.2f}%\".format(100 * (stats.norm.cdf(3) - stats.norm.cdf(-3))))" ] }, { "cell_type": "markdown", "id": "6661f9b7-d23d-4ce6-bfd2-c8690a9782a0", "metadata": {}, "source": [ "In general, for bell-shaped distributions, the SD is the distance between the mean and the \n", "points of inflection on either side." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }