{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7c2801b6-fd79-40fa-a514-8db9349bfc5a",
   "metadata": {},
   "source": [
    "# Chapter 15 - Prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datascience import *\n",
    "import numpy as np\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aee0c8ef-e8da-4f9f-b691-ead4c9da9eb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate somewhat random data for the equation y = -3x + 250\n",
    "def generate_table(number_items):\n",
    "    result = Table(make_array(\"x\", \"y\"))\n",
    "    for _ in range(number_items):\n",
    "        x = np.random.random() * 100\n",
    "        delta = 50 * np.random.random() - 25\n",
    "        y = -3 * (x + delta) + 250\n",
    "        result = result.with_row(make_array(x,y))\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c07f4a89-a070-48fc-993b-661e14f48979",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = generate_table(100)\n",
    "data.scatter(\"x\")\n",
    "plots.plot([0, 100], [250, -50], color='red', lw=2);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bcfc468-1584-4c90-990d-eceeda72dc5b",
   "metadata": {},
   "source": [
    "## 15.1 - Correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fb8e881-be9d-4885-bf27-f078d82b2631",
   "metadata": {},
   "source": [
    "**Linear Assocation** - how tightly clustered a scatter diagram is about a straight line.\n",
    "Do x and y in the graph above display a *positive* or *negative* association?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36f387a4-64fd-439d-92f0-4baa3946cdf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Recall from Chapter 14 ...\n",
    "def standard_units(numbers):\n",
    "    \"Convert any array of numbers to standard units.\"\n",
    "    return (numbers - np.mean(numbers))/np.std(numbers)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "559cf34f-46ce-40cd-9a50-0c140afb6167",
   "metadata": {},
   "outputs": [],
   "source": [
    "standardized_data = Table().with_columns(\n",
    "    'x (standard units)',  standard_units(data.column('x')), \n",
    "    'y (standard units)', standard_units(data.column('y'))\n",
    ")\n",
    "standardized_data.scatter(0, 1)\n",
    "plots.xlim(-3, 3)\n",
    "plots.ylim(-3, 3);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "289833bb-d381-4029-a027-226255c1abf3",
   "metadata": {},
   "source": [
    "*Correlation Coefficient* or **r** - measures the strength of the linear relationship between two variables.\n",
    "- The correlation coefficient is a number between -1 and 1.\n",
    "- **r** measures the extent to which the scatter plot clusters around a straight line.\n",
    "- if **r** == 1, the scatter diagram is a 45 degree perfect straight line sloping upwards, \n",
    "and if **r** == -1 the scatter diagram is a 45 degree perfect straight line sloping downwards.\n",
    "- **r**  is the average of the products of the two variables, when both variables are measured in standard units."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4da86f2d-d5aa-4d52-a820-d38bc8cdf6c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "r = np.mean(standardized_data.column(\"x (standard units)\") * standardized_data.column(\"y (standard units)\"))\n",
    "print(\"r =\", r)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e894498-06be-408b-bd0f-ac5ed05c1eb3",
   "metadata": {},
   "source": [
    "What would r be if *delta* in *generate_table* above is always 0?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f092b11d-0f1a-4030-a90b-3c0a809caa19",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's make a function to calculate r from non-standardized data\n",
    "def correlation(table, label_1, label_2):\n",
    "    return np.mean(standard_units(table.column(label_1))*standard_units(table.column(label_2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35fed660-abeb-4e88-8b84-f1c8fe5d4b37",
   "metadata": {},
   "outputs": [],
   "source": [
    "r = correlation(data, \"x\", \"y\")\n",
    "print(\"r =\", r)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65e062d8-303b-48fb-928f-c0d59647ddce",
   "metadata": {},
   "source": [
    "A few caveats\n",
    "- Correlation measures association, not causation\n",
    "- Correlation only measures linear association\n",
    "- Outliers can significantly impact correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abecab52-e58b-48a4-88ef-0d8fccae7a36",
   "metadata": {},
   "source": [
    "## 15.2 - The Regression Line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a356227-9c78-4d2b-a2ee-ea295834b971",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate the slope of the regression line, y = mx + b\n",
    "m = r * (np.std(data.column(\"y\")) / np.std(data.column(\"x\")))\n",
    "print(\"m =\", m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06d69f3e-86d7-4aa5-bc56-282a83c33744",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate the intercept of the regression line, y = mx + b\n",
    "b = np.mean(data.column(\"y\")) - m*np.mean(data.column(\"x\"))\n",
    "print(\"b =\", b)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "164a35b1-3abe-4715-9ede-071a344063dd",
   "metadata": {},
   "source": [
    "Thus, the equation of the regression line is approximately\n",
    "\n",
    "        y = mx + b (plug in m and b values from above)\n",
    "\n",
    "This should be pretty close to the following equation that was used to generate the random data\n",
    "\n",
    "        y = -3x + 250"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}