{ "cells": [ { "cell_type": "markdown", "id": "1b0af86b-e154-42df-a2d2-f04205828112", "metadata": {}, "source": [ "# Chapter 15 - Prediction" ] }, { "cell_type": "markdown", "id": "a9dcf322-4648-4eaf-a878-f10fc0417247", "metadata": {}, "source": [ "## Chapter 15.5 - Visual Diagnostics" ] }, { "cell_type": "markdown", "id": "b22e8371-8a05-42dc-9396-01d50dab6cbf", "metadata": {}, "source": [ "Previously seen material ..." ] }, { "cell_type": "code", "execution_count": null, "id": "9538feb6-a0fb-4e3f-9a4f-5d45072870b5", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots" ] }, { "cell_type": "code", "execution_count": null, "id": "aee0c8ef-e8da-4f9f-b691-ead4c9da9eb8", "metadata": {}, "outputs": [], "source": [ "# Generate somewhat random data for the equation y = 2.5x - 50\n", "def generate_table(number_items):\n", " result = Table(make_array(\"x\", \"y\"))\n", " for _ in range(number_items):\n", " x = np.random.random() * 100\n", " delta = 20 * np.random.random() - 10\n", " y = 2.5 * (x + delta) - 50\n", " result = result.with_row(make_array(x,y))\n", " return result" ] }, { "cell_type": "code", "execution_count": null, "id": "c07f4a89-a070-48fc-993b-661e14f48979", "metadata": {}, "outputs": [], "source": [ "data = generate_table(100)\n", "data.scatter(\"x\")\n", "data.show(3)\n", "plots.plot([0, 100], [-50, 200], color=\"red\", lw=2);" ] }, { "cell_type": "code", "execution_count": null, "id": "f092b11d-0f1a-4030-a90b-3c0a809caa19", "metadata": {}, "outputs": [], "source": [ "def standard_units(numbers):\n", " \"Convert any array of numbers to standard units.\"\n", " return (numbers - np.mean(numbers))/np.std(numbers) \n", " \n", "def correlation(t, label_x, label_y):\n", " \"Calculate r\"\n", " return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))\n", "\n", "def slope(t, label_x, label_y):\n", " \"Calculate m of y = mx + b\"\n", " r = correlation(t, label_x, label_y)\n", " return r*np.std(t.column(label_y))/np.std(t.column(label_x))\n", "\n", "def intercept(t, label_x, label_y):\n", " \"Calculate b of y = mx + b\"\n", " return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))\n", "\n", "def fit(table, x, y):\n", " \"\"\"Return the height of the regression line at each x value.\"\"\"\n", " a = slope(table, x, y)\n", " b = intercept(table, x, y)\n", " return a * table.column(x) + b" ] }, { "cell_type": "markdown", "id": "d0b3e5bb-3d6b-4978-bf33-98676a8abfcc", "metadata": {}, "source": [ "New material ..." ] }, { "cell_type": "markdown", "id": "eebfed71-0d6c-48a4-891b-799108cbb36b", "metadata": {}, "source": [ "**residual** = observed_value - regression_estimate" ] }, { "cell_type": "code", "execution_count": null, "id": "91d83174-389b-4768-8036-0dc0af9d58db", "metadata": {}, "outputs": [], "source": [ "def residual(table, x, y):\n", " return table.column(y) - fit(table, x, y)" ] }, { "cell_type": "code", "execution_count": null, "id": "abe0f381-1313-4b31-885e-4d31fbd860a3", "metadata": {}, "outputs": [], "source": [ "data = data.with_columns(\n", " \"Fitted Value\", fit(data, \"x\", \"y\"),\n", " \"Residual\", residual(data, \"x\", \"y\")\n", " )\n", "data.show(3)" ] }, { "cell_type": "markdown", "id": "a10e6992-46b5-4af8-b0d1-a7cb62dd326e", "metadata": {}, "source": [ "Plotting the (x, Residual) pairs let us make a visual diagnosis of the quality of the\n", "linear regression analysis. **The residual plot of a good regression shows no pattern. \n", "The residuals look about the same, above and below the horizontal line at 0, across the range of \n", "the predictor variable.**" ] }, { "cell_type": "code", "execution_count": null, "id": "7d8e21d1-1d3d-46ab-87c2-fc8a9f637a12", "metadata": {}, "outputs": [], "source": [ "residual_table = Table().with_columns(\n", " \"x\", data.column(\"x\"),\n", " \"residuals\", residual(data, \"x\", \"y\")\n", ")\n", "residual_table.scatter(\"x\", \"residuals\", color=\"r\")\n", "xlims = make_array(min(data.column(\"x\")), max(data.column(\"x\")))\n", "plots.plot(xlims, make_array(0, 0), color=\"darkblue\", lw=4)\n", "plots.title(\"Residual Plot\");" ] }, { "cell_type": "markdown", "id": "cc65c219-f5aa-4cf9-85cb-5a574663f20d", "metadata": {}, "source": [ "**When a residual plot shows a pattern, there may be a non-linear relation between the variables.** \n", "Terminology: *Heteroscedasticity* means an uneven spread of the data." ] }, { "cell_type": "markdown", "id": "c35e601a-eb62-4986-8680-534a26ff9286", "metadata": {}, "source": [ "Is there a pattern in the above plot? What does this tell us about the likely relation between x and y?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }