{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "638b94c5-19fd-4762-8103-52b70eff5eb5",
   "metadata": {},
   "source": [
    "# Chapter 14 - Why the Mean Matters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datascience import *\n",
    "import numpy as np\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c187b5a-bf1c-4da2-b1c0-5e5b4a80ade8",
   "metadata": {},
   "source": [
    "## 14.1 - Properties of the Mean"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e74d593b-767e-4496-b40a-4b2e2e9f74fe",
   "metadata": {},
   "source": [
    "**Average** or **Mean** of a collection of numbers: the sum of all the elements of the collection, \n",
    "divided by the number of elements in the collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25e8672e-a258-44e9-85a1-7689502881c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "heights = make_array(66, 73, 65, 67)\n",
    "print(np.average(heights))\n",
    "print(np.mean(heights))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "095d849a-d200-405a-b8a4-54c796d8eb74",
   "metadata": {},
   "source": [
    "Proportions are means."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "446cdbd1-71b1-49da-991f-30883a6823f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "at_least_5_10 = make_array(0, 1, 0, 0)\n",
    "print(np.mean(at_least_5_10))\n",
    "at_least_5_10 = make_array(False, True, False, False)\n",
    "print(np.mean(at_least_5_10))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8925c92-5470-45d2-9d13-a50bc8847cb1",
   "metadata": {},
   "source": [
    "The mean of a collection depends only on the distinct values and their proportions, not on the number of elements \n",
    "in the collection. In other words, the mean of a collection depends only on the distribution of values in the collection.\n",
    "Therefore, **if two collections have the same distribution, then they have the same mean**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f719e73-6007-46c1-89f0-d675151ff4cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "heights_2 = make_array(65, 66, 67, 73, 73, 67, 66, 65)\n",
    "np.mean(heights)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29896fa8-66a2-42ed-b830-d5674bf9cea6",
   "metadata": {},
   "source": [
    "The mean is the center of gravity or balance point of the histogram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eac9da3f-7661-45b1-8ef3-34b1d8030a6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "heights_table = Table().with_column(\"Heights\", heights)\n",
    "heights_table.hist(\"Heights\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09ae8efb-8c17-46fb-a4e0-c91c814f138f",
   "metadata": {},
   "outputs": [],
   "source": [
    "percentile(50, heights)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a48cc4d2-bd2d-4c2d-a110-5bc0c051af8e",
   "metadata": {},
   "source": [
    "Notice that more than half of the students (75% in this example) can be below average in height."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8cfa2f4-a4da-48d0-94b0-14ac13f2ded3",
   "metadata": {},
   "source": [
    "## 14.2 - Variability"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f603103-5854-4c40-a868-fff94f461d9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "deviations = heights - np.mean(heights)\n",
    "heights_table = heights_table.with_column(\"Deviation from Mean\", deviations)\n",
    "heights_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b5ae1a9-53e2-41d5-946a-8d18b1adb286",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: the sum of the deviations from average is zero\n",
    "np.mean(heights_table.column(\"Deviation from Mean\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "969b6de9-ed1b-4cc0-a381-0fc7dfe9ee67",
   "metadata": {},
   "outputs": [],
   "source": [
    "squared_deviations = deviations ** 2\n",
    "heights_table = heights_table.with_column(\"Squared Deviation from Average\", squared_deviations)\n",
    "heights_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5592e095-7672-49c3-9b5c-f761be62f185",
   "metadata": {},
   "outputs": [],
   "source": [
    "variance = np.mean(heights_table.column(\"Squared Deviation from Average\"))\n",
    "variance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc8255f5-c25b-451b-ab48-17241da2b3a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "standard_deviation = variance ** 0.5\n",
    "standard_deviation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c643b1ae-f094-452d-877e-d951a5450844",
   "metadata": {},
   "source": [
    "**Standard Deviation (SD)** of a list: the root mean square of deviations from average. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f24377c6-8161-47f7-990c-f6513444f285",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.std(heights)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "785d0cbd-639e-4eca-8962-bc3014f2045e",
   "metadata": {},
   "source": [
    "**Chebychev’s Bounds**: For all lists, and all numbers *z*,the proportion of entries that are in the range \n",
    "“average *+-z* SDs” is at least *1 - 1/z^2*."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75d2e057-e3b5-4382-bd3a-cb2a52481ef3",
   "metadata": {},
   "source": [
    "Let's examine the variability of the length of songs in our spotify csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4848b1e0-2e13-4068-b653-618074856d0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "song_lengths = Table.read_table(\"top_spotify_songs_usa.csv\").select(\"duration_ms\")\n",
    "song_lengths.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "196f1ee2-e65d-478f-8166-d83e61836520",
   "metadata": {},
   "source": [
    "To convert a value to **standard units**, compute (value - average) / SD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36f387a4-64fd-439d-92f0-4baa3946cdf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def standard_units(numbers_array):\n",
    "    \"Convert any array of numbers to standard units.\"\n",
    "    return (numbers_array - np.mean(numbers_array))/np.std(numbers_array)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2cf6c5e-e4fc-4716-a905-1a5f0364fbda",
   "metadata": {},
   "outputs": [],
   "source": [
    "song_lengths = song_lengths.with_column(\"Length (Standard Units)\", standard_units(song_lengths.column(\"duration_ms\")))\n",
    "song_lengths.show(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19dbc38a-17cf-437f-844a-0fa1bfc464b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "song_lengths.sort(\"duration_ms\", descending=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da8ecf09-1ce6-4bb1-a23b-62c76a8c6f17",
   "metadata": {},
   "outputs": [],
   "source": [
    "song_lengths.sort(\"duration_ms\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1e5cfef-c3ce-4f80-80f0-8ec6aedc2b12",
   "metadata": {},
   "outputs": [],
   "source": [
    "within_3_sd = song_lengths.where('Length (Standard Units)', are.between(-3, 3))\n",
    "print(\"Songs within 3 standard deviations of mean: {:.2f}%\".format(100 * within_3_sd.num_rows/song_lengths.num_rows))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbc42685-2c35-48f4-9fa1-c2a5047439bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Cherbychev's Bounds predicts this number should be at least {:.2f}%\".format((1 - 1/3**2) * 100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2298ba01-c9c1-443b-9580-f7763ba1461c",
   "metadata": {},
   "outputs": [],
   "source": [
    "song_lengths.hist('Length (Standard Units)', bins=np.arange(-4, 7, 0.5))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}