{ "cells": [ { "cell_type": "markdown", "id": "7c2801b6-fd79-40fa-a514-8db9349bfc5a", "metadata": {}, "source": [ "# Chapter 15 - Prediction" ] }, { "cell_type": "code", "execution_count": null, "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots" ] }, { "cell_type": "code", "execution_count": null, "id": "aee0c8ef-e8da-4f9f-b691-ead4c9da9eb8", "metadata": {}, "outputs": [], "source": [ "# Generate somewhat random data for the equation y = -3x + 250\n", "def generate_table(number_items):\n", " result = Table(make_array(\"x\", \"y\"))\n", " for _ in range(number_items):\n", " x = np.random.random() * 100\n", " delta = 50 * np.random.random() - 25\n", " y = -3 * (x + delta) + 250\n", " result = result.with_row(make_array(x,y))\n", " return result" ] }, { "cell_type": "code", "execution_count": null, "id": "c07f4a89-a070-48fc-993b-661e14f48979", "metadata": {}, "outputs": [], "source": [ "data = generate_table(100)\n", "data.scatter(\"x\")\n", "plots.plot([0, 100], [250, -50], color='red', lw=2);" ] }, { "cell_type": "markdown", "id": "3bcfc468-1584-4c90-990d-eceeda72dc5b", "metadata": {}, "source": [ "## 15.1 - Correlation" ] }, { "cell_type": "markdown", "id": "3fb8e881-be9d-4885-bf27-f078d82b2631", "metadata": {}, "source": [ "**Linear Assocation** - how tightly clustered a scatter diagram is about a straight line.\n", "Do x and y in the graph above display a *positive* or *negative* association?" ] }, { "cell_type": "code", "execution_count": null, "id": "36f387a4-64fd-439d-92f0-4baa3946cdf1", "metadata": {}, "outputs": [], "source": [ "# Recall from Chapter 14 ...\n", "def standard_units(numbers):\n", " \"Convert any array of numbers to standard units.\"\n", " return (numbers - np.mean(numbers))/np.std(numbers) " ] }, { "cell_type": "code", "execution_count": null, "id": "559cf34f-46ce-40cd-9a50-0c140afb6167", "metadata": {}, "outputs": [], "source": [ "standardized_data = Table().with_columns(\n", " 'x (standard units)', standard_units(data.column('x')), \n", " 'y (standard units)', standard_units(data.column('y'))\n", ")\n", "standardized_data.scatter(0, 1)\n", "plots.xlim(-3, 3)\n", "plots.ylim(-3, 3);" ] }, { "cell_type": "markdown", "id": "289833bb-d381-4029-a027-226255c1abf3", "metadata": {}, "source": [ "*Correlation Coefficient* or **r** - measures the strength of the linear relationship between two variables.\n", "- The correlation coefficient is a number between -1 and 1.\n", "- **r** measures the extent to which the scatter plot clusters around a straight line.\n", "- if **r** == 1, the scatter diagram is a 45 degree perfect straight line sloping upwards, \n", "and if **r** == -1 the scatter diagram is a 45 degree perfect straight line sloping downwards.\n", "- **r** is the average of the products of the two variables, when both variables are measured in standard units." ] }, { "cell_type": "code", "execution_count": null, "id": "4da86f2d-d5aa-4d52-a820-d38bc8cdf6c1", "metadata": {}, "outputs": [], "source": [ "r = np.mean(standardized_data.column(\"x (standard units)\") * standardized_data.column(\"y (standard units)\"))\n", "print(\"r =\", r)" ] }, { "cell_type": "markdown", "id": "4e894498-06be-408b-bd0f-ac5ed05c1eb3", "metadata": {}, "source": [ "What would r be if *delta* in *generate_table* above is always 0?" ] }, { "cell_type": "code", "execution_count": null, "id": "f092b11d-0f1a-4030-a90b-3c0a809caa19", "metadata": {}, "outputs": [], "source": [ "# Let's make a function to calculate r from non-standardized data\n", "def correlation(table, label_1, label_2):\n", " return np.mean(standard_units(table.column(label_1))*standard_units(table.column(label_2)))" ] }, { "cell_type": "code", "execution_count": null, "id": "35fed660-abeb-4e88-8b84-f1c8fe5d4b37", "metadata": {}, "outputs": [], "source": [ "r = correlation(data, \"x\", \"y\")\n", "print(\"r =\", r)" ] }, { "cell_type": "markdown", "id": "65e062d8-303b-48fb-928f-c0d59647ddce", "metadata": {}, "source": [ "A few caveats\n", "- Correlation measures association, not causation\n", "- Correlation only measures linear association\n", "- Outliers can significantly impact correlation" ] }, { "cell_type": "markdown", "id": "abecab52-e58b-48a4-88ef-0d8fccae7a36", "metadata": {}, "source": [ "## 15.2 - The Regression Line" ] }, { "cell_type": "code", "execution_count": null, "id": "0a356227-9c78-4d2b-a2ee-ea295834b971", "metadata": {}, "outputs": [], "source": [ "# Calculate the slope of the regression line, y = mx + b\n", "m = r * (np.std(data.column(\"y\")) / np.std(data.column(\"x\")))\n", "print(\"m =\", m)" ] }, { "cell_type": "code", "execution_count": null, "id": "06d69f3e-86d7-4aa5-bc56-282a83c33744", "metadata": {}, "outputs": [], "source": [ "# Calculate the intercept of the regression line, y = mx + b\n", "b = np.mean(data.column(\"y\")) - m*np.mean(data.column(\"x\"))\n", "print(\"b =\", b)" ] }, { "cell_type": "markdown", "id": "164a35b1-3abe-4715-9ede-071a344063dd", "metadata": {}, "source": [ "Thus, the equation of the regression line is approximately\n", "\n", " y = mx + b (plug in m and b values from above)\n", "\n", "This should be pretty close to the following equation that was used to generate the random data\n", "\n", " y = -3x + 250" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }