{ "cells": [ { "cell_type": "markdown", "id": "1b0af86b-e154-42df-a2d2-f04205828112", "metadata": {}, "source": [ "# Chapter 15 - Prediction" ] }, { "cell_type": "markdown", "id": "aa44132e-795b-4563-837e-237a4dd661e0", "metadata": {}, "source": [ "## Chapter 15.3 - The Method of Least Squares" ] }, { "cell_type": "code", "execution_count": null, "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots" ] }, { "cell_type": "code", "execution_count": null, "id": "aee0c8ef-e8da-4f9f-b691-ead4c9da9eb8", "metadata": {}, "outputs": [], "source": [ "# Generate somewhat random data for the equation y = 2.5x - 50\n", "def generate_table(number_items):\n", " result = Table(make_array(\"x\", \"y\"))\n", " for _ in range(number_items):\n", " x = np.random.random() * 100\n", " delta = 20 * np.random.random() - 10\n", " y = 2.5 * (x + delta) - 50\n", " result = result.with_row(make_array(x,y))\n", " return result" ] }, { "cell_type": "code", "execution_count": null, "id": "c07f4a89-a070-48fc-993b-661e14f48979", "metadata": {}, "outputs": [], "source": [ "data = generate_table(100)\n", "data.scatter(\"x\")\n", "plots.plot([0, 100], [-50, 200], color='red', lw=2);" ] }, { "cell_type": "code", "execution_count": null, "id": "36f387a4-64fd-439d-92f0-4baa3946cdf1", "metadata": {}, "outputs": [], "source": [ "def standard_units(numbers):\n", " \"Convert any array of numbers to standard units.\"\n", " return (numbers - np.mean(numbers))/np.std(numbers) " ] }, { "cell_type": "code", "execution_count": null, "id": "559cf34f-46ce-40cd-9a50-0c140afb6167", "metadata": {}, "outputs": [], "source": [ "standardized_data = Table().with_columns(\n", " 'x (standard units)', standard_units(data.column('x')), \n", " 'y (standard units)', standard_units(data.column('y'))\n", ")\n", "standardized_data.scatter(0, 1)\n", "plots.xlim(-3, 3)\n", "plots.ylim(-3, 3);" ] }, { "cell_type": "code", "execution_count": null, "id": "f092b11d-0f1a-4030-a90b-3c0a809caa19", "metadata": {}, "outputs": [], "source": [ "# Calculate r\n", "def correlation(t, label_x, label_y):\n", " return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))\n", "\n", "# Calculate m of y = mx + b\n", "def slope(t, label_x, label_y):\n", " r = correlation(t, label_x, label_y)\n", " return r*np.std(t.column(label_y))/np.std(t.column(label_x))\n", "\n", "# Calculate b of y = mx + b\n", "def intercept(t, label_x, label_y):\n", " return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))" ] }, { "cell_type": "code", "execution_count": null, "id": "35fed660-abeb-4e88-8b84-f1c8fe5d4b37", "metadata": {}, "outputs": [], "source": [ "r = correlation(data, \"x\", \"y\")\n", "m = slope(data, \"x\", \"y\")\n", "b = intercept(data, \"x\", \"y\")\n", "print(\"r =\", r)\n", "print(\"m =\", m)\n", "print(\"b =\", b)" ] }, { "cell_type": "markdown", "id": "f1942276-1355-4582-af97-547dc9076498", "metadata": {}, "source": [ "Question - What is a resonable way to calculate the error of our candidate line, y = mx + b? \n", "Answer - Calculate the **root mean squared error**." ] }, { "cell_type": "code", "execution_count": null, "id": "aad72f9f-39a0-496d-9f13-79a4a97cdafa", "metadata": {}, "outputs": [], "source": [ "# A general function to compute the Root Mean Squared Error\n", "def rmse_general(table, x_label, y_label, slope, intercept):\n", " x = table.column(x_label)\n", " y = table.column(y_label)\n", " fitted = slope * x + intercept\n", " mse = np.mean((y - fitted) ** 2)\n", " return mse ** 0.5" ] }, { "cell_type": "code", "execution_count": null, "id": "02af2ef4-270d-48a0-8c17-8fb406dc3656", "metadata": {}, "outputs": [], "source": [ "print(\"Root mean squared error =\", rmse_general(data, \"x\", \"y\", m, b))" ] }, { "cell_type": "markdown", "id": "b9d9b4ef-2134-406e-bac9-9d884d6a54c8", "metadata": {}, "source": [ "Useful, Provable Fact: **The regression line is the unique straight line that minimizes the mean squared error of \n", "estimation among all straight lines!**" ] }, { "cell_type": "code", "execution_count": null, "id": "1e19ff87-83fb-453e-b770-17a8a2af8c58", "metadata": {}, "outputs": [], "source": [ "# A less general function to compute the Root Mean Squared Error\n", "# However, this form works with the built-in function minimize\n", "def rmse(slope, intercept):\n", " x = data.column(\"x\")\n", " y = data.column(\"y\")\n", " fitted = slope * x + intercept\n", " mse = np.mean((y - fitted) ** 2)\n", " return mse ** 0.5" ] }, { "cell_type": "code", "execution_count": null, "id": "e0faee0a-d91a-41f3-adf2-393b5d436c95", "metadata": {}, "outputs": [], "source": [ "minimize(rmse)" ] }, { "cell_type": "markdown", "id": "170fb812-c059-49c8-b26f-2cb4152331bf", "metadata": {}, "source": [ "These are the same values (more or less) that we calculated using the functions above." ] }, { "cell_type": "markdown", "id": "a3dbc5f9-d266-422d-943e-d468589eb2ea", "metadata": {}, "source": [ "## 15.4 - Least Squares Regression" ] }, { "cell_type": "markdown", "id": "855d497c-8482-4caf-8bea-b2090f00d1e7", "metadata": {}, "source": [ "Observations \n", "- Even if the data is not linear, there is a unique line that minimizes the mean squared error of estimation. \n", "This line can be identified using the technique of 15.3. \n", "- The minimize function can be applied to any type of function." ] }, { "cell_type": "code", "execution_count": null, "id": "d34cceb3-fefa-4606-bd43-0f221f689331", "metadata": {}, "outputs": [], "source": [ "# y = -2x^2 + 7x - 3\n", "x = np.arange(10, 20, .3)\n", "y = make_array()\n", "for x_value in x:\n", " y = np.append(y, -2*x_value*x_value + 7*x_value - 3)\n", "quadratic = Table().with_columns(\n", " \"x\", x,\n", " \"y\", y\n", ")\n", "quadratic.show(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "3fb30963-be9c-4cc0-a6e8-d8b41da455da", "metadata": {}, "outputs": [], "source": [ "# y = ax^2 + bx + c\n", "def rmse_quadratic_example(a, b, c):\n", " x = quadratic.column(\"x\")\n", " y = quadratic.column(\"y\")\n", " fitted = a*x*x + b*x + c\n", " mse = np.mean((y - fitted) ** 2)\n", " return mse ** 0.5" ] }, { "cell_type": "code", "execution_count": null, "id": "00bc925a-feba-4e5a-9920-cad2aabb4313", "metadata": {}, "outputs": [], "source": [ "minimize(rmse_quadratic_example)" ] }, { "cell_type": "markdown", "id": "73eab30b-98e9-4004-924e-800cf835dc0e", "metadata": {}, "source": [ "**Active Learning** (using the *grades_and_piazza.csv* file from the Chapter 8 materials):\n", "- Calculate *r* (the correlation coefficient) between GPA and each of the other six data items\n", "- Using the most significant *r* value, deploy *minimize* to identify the linear equation that minimizes the root mean squared error" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }