{ "cells": [ { "cell_type": "markdown", "id": "638b94c5-19fd-4762-8103-52b70eff5eb5", "metadata": {}, "source": [ "# Chapter 14 - Why the Mean Matters" ] }, { "cell_type": "code", "execution_count": null, "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "7c187b5a-bf1c-4da2-b1c0-5e5b4a80ade8", "metadata": {}, "source": [ "## 14.1 - Properties of the Mean" ] }, { "cell_type": "markdown", "id": "e74d593b-767e-4496-b40a-4b2e2e9f74fe", "metadata": {}, "source": [ "**Average** or **Mean** of a collection of numbers: the sum of all the elements of the collection, \n", "divided by the number of elements in the collection." ] }, { "cell_type": "code", "execution_count": null, "id": "25e8672e-a258-44e9-85a1-7689502881c8", "metadata": {}, "outputs": [], "source": [ "heights = make_array(66, 73, 65, 67)\n", "print(np.average(heights))\n", "print(np.mean(heights))" ] }, { "cell_type": "markdown", "id": "095d849a-d200-405a-b8a4-54c796d8eb74", "metadata": {}, "source": [ "Proportions are means." ] }, { "cell_type": "code", "execution_count": null, "id": "446cdbd1-71b1-49da-991f-30883a6823f2", "metadata": {}, "outputs": [], "source": [ "at_least_5_10 = make_array(0, 1, 0, 0)\n", "print(np.mean(at_least_5_10))\n", "at_least_5_10 = make_array(False, True, False, False)\n", "print(np.mean(at_least_5_10))" ] }, { "cell_type": "markdown", "id": "b8925c92-5470-45d2-9d13-a50bc8847cb1", "metadata": {}, "source": [ "The mean of a collection depends only on the distinct values and their proportions, not on the number of elements \n", "in the collection. In other words, the mean of a collection depends only on the distribution of values in the collection.\n", "Therefore, **if two collections have the same distribution, then they have the same mean**." ] }, { "cell_type": "code", "execution_count": null, "id": "1f719e73-6007-46c1-89f0-d675151ff4cd", "metadata": {}, "outputs": [], "source": [ "heights_2 = make_array(65, 66, 67, 73, 73, 67, 66, 65)\n", "np.mean(heights)" ] }, { "cell_type": "markdown", "id": "29896fa8-66a2-42ed-b830-d5674bf9cea6", "metadata": {}, "source": [ "The mean is the center of gravity or balance point of the histogram." ] }, { "cell_type": "code", "execution_count": null, "id": "eac9da3f-7661-45b1-8ef3-34b1d8030a6c", "metadata": {}, "outputs": [], "source": [ "heights_table = Table().with_column(\"Heights\", heights)\n", "heights_table.hist(\"Heights\")" ] }, { "cell_type": "code", "execution_count": null, "id": "09ae8efb-8c17-46fb-a4e0-c91c814f138f", "metadata": {}, "outputs": [], "source": [ "percentile(50, heights)" ] }, { "cell_type": "markdown", "id": "a48cc4d2-bd2d-4c2d-a110-5bc0c051af8e", "metadata": {}, "source": [ "Notice that more than half of the students (75% in this example) can be below average in height." ] }, { "cell_type": "markdown", "id": "b8cfa2f4-a4da-48d0-94b0-14ac13f2ded3", "metadata": {}, "source": [ "## 14.2 - Variability" ] }, { "cell_type": "code", "execution_count": null, "id": "9f603103-5854-4c40-a868-fff94f461d9f", "metadata": {}, "outputs": [], "source": [ "deviations = heights - np.mean(heights)\n", "heights_table = heights_table.with_column(\"Deviation from Mean\", deviations)\n", "heights_table" ] }, { "cell_type": "code", "execution_count": null, "id": "1b5ae1a9-53e2-41d5-946a-8d18b1adb286", "metadata": {}, "outputs": [], "source": [ "# Note: the sum of the deviations from average is zero\n", "np.mean(heights_table.column(\"Deviation from Mean\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "969b6de9-ed1b-4cc0-a381-0fc7dfe9ee67", "metadata": {}, "outputs": [], "source": [ "squared_deviations = deviations ** 2\n", "heights_table = heights_table.with_column(\"Squared Deviation from Average\", squared_deviations)\n", "heights_table" ] }, { "cell_type": "code", "execution_count": null, "id": "5592e095-7672-49c3-9b5c-f761be62f185", "metadata": {}, "outputs": [], "source": [ "variance = np.mean(heights_table.column(\"Squared Deviation from Average\"))\n", "variance" ] }, { "cell_type": "code", "execution_count": null, "id": "cc8255f5-c25b-451b-ab48-17241da2b3a7", "metadata": {}, "outputs": [], "source": [ "standard_deviation = variance ** 0.5\n", "standard_deviation" ] }, { "cell_type": "markdown", "id": "c643b1ae-f094-452d-877e-d951a5450844", "metadata": {}, "source": [ "**Standard Deviation (SD)** of a list: the root mean square of deviations from average. " ] }, { "cell_type": "code", "execution_count": null, "id": "f24377c6-8161-47f7-990c-f6513444f285", "metadata": {}, "outputs": [], "source": [ "np.std(heights)" ] }, { "cell_type": "markdown", "id": "785d0cbd-639e-4eca-8962-bc3014f2045e", "metadata": {}, "source": [ "**Chebychev’s Bounds**: For all lists, and all numbers *z*,the proportion of entries that are in the range \n", "“average *+-z* SDs” is at least *1 - 1/z^2*." ] }, { "cell_type": "markdown", "id": "75d2e057-e3b5-4382-bd3a-cb2a52481ef3", "metadata": {}, "source": [ "Let's examine the variability of the length of songs in our spotify csv file." ] }, { "cell_type": "code", "execution_count": null, "id": "4848b1e0-2e13-4068-b653-618074856d0d", "metadata": {}, "outputs": [], "source": [ "song_lengths = Table.read_table(\"top_spotify_songs_usa.csv\").select(\"duration_ms\")\n", "song_lengths.show(5)" ] }, { "cell_type": "markdown", "id": "196f1ee2-e65d-478f-8166-d83e61836520", "metadata": {}, "source": [ "To convert a value to **standard units**, compute (value - average) / SD" ] }, { "cell_type": "code", "execution_count": null, "id": "36f387a4-64fd-439d-92f0-4baa3946cdf1", "metadata": {}, "outputs": [], "source": [ "def standard_units(numbers_array):\n", " \"Convert any array of numbers to standard units.\"\n", " return (numbers_array - np.mean(numbers_array))/np.std(numbers_array) " ] }, { "cell_type": "code", "execution_count": null, "id": "f2cf6c5e-e4fc-4716-a905-1a5f0364fbda", "metadata": {}, "outputs": [], "source": [ "song_lengths = song_lengths.with_column(\"Length (Standard Units)\", standard_units(song_lengths.column(\"duration_ms\")))\n", "song_lengths.show(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "19dbc38a-17cf-437f-844a-0fa1bfc464b5", "metadata": {}, "outputs": [], "source": [ "song_lengths.sort(\"duration_ms\", descending=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "da8ecf09-1ce6-4bb1-a23b-62c76a8c6f17", "metadata": {}, "outputs": [], "source": [ "song_lengths.sort(\"duration_ms\")" ] }, { "cell_type": "code", "execution_count": null, "id": "e1e5cfef-c3ce-4f80-80f0-8ec6aedc2b12", "metadata": {}, "outputs": [], "source": [ "within_3_sd = song_lengths.where('Length (Standard Units)', are.between(-3, 3))\n", "print(\"Songs within 3 standard deviations of mean: {:.2f}%\".format(100 * within_3_sd.num_rows/song_lengths.num_rows))" ] }, { "cell_type": "code", "execution_count": null, "id": "cbc42685-2c35-48f4-9fa1-c2a5047439bf", "metadata": {}, "outputs": [], "source": [ "print(\"Cherbychev's Bounds predicts this number should be at least {:.2f}%\".format((1 - 1/3**2) * 100))" ] }, { "cell_type": "code", "execution_count": null, "id": "2298ba01-c9c1-443b-9580-f7763ba1461c", "metadata": {}, "outputs": [], "source": [ "song_lengths.hist('Length (Standard Units)', bins=np.arange(-4, 7, 0.5))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }