{ "cells": [ { "cell_type": "markdown", "id": "77c7aedf-974a-48bd-acae-fba0b60356a4", "metadata": {}, "source": [ "# Chapter 13.2 - The Bootstrap" ] }, { "cell_type": "code", "execution_count": null, "id": "d5516087-9b25-409f-94d5-f11dd0c167c0", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "%matplotlib inline\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "id": "324be9b7-0e03-4db1-9928-17de96c0a99a", "metadata": {}, "outputs": [], "source": [ "# Place the csv file in the same directory as this notebook\n", "ski_resorts = Table().read_table(\"ski_resorts.csv\")\n", "ski_resorts.show(5)" ] }, { "cell_type": "markdown", "id": "73dea73e-07b0-428e-9784-5beff81267b4", "metadata": {}, "source": [ "Suppose that a data scientist only has access to one sample that contains 200 random entries from this data set and\n", "needs to estimate the average **Total Snowfall** across all seasons and all resorts." ] }, { "cell_type": "markdown", "id": "813ec16d-fc50-47b0-ba26-d70897af4874", "metadata": {}, "source": [ "By how much could those estimates vary? To answer this, it appears as though she needs to have access to other samples of 200 random entries\n", "from the population. Unfortunately, she only has access to this one random sample. Is she stuck?" ] }, { "cell_type": "markdown", "id": "ba545be2-e64a-4350-bed4-c68cf8a79b8c", "metadata": {}, "source": [ "Fortunately no - an idea called **the bootstrap** can help. Since it is not feasible to generate new samples from the population, \n", "the bootstrap generates new random samples by a method called **resampling** that draws new samples at random from the *original sample*." ] }, { "cell_type": "code", "execution_count": null, "id": "3501b588-5066-4da4-bcd7-1e5e9b385082", "metadata": {}, "outputs": [], "source": [ "ski_resorts = ski_resorts.sort(\"Total Snowfall\", descending=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "b5ab39e5-3457-4944-a8db-ae60ba6dc37e", "metadata": {}, "outputs": [], "source": [ "ski_resorts.take(np.arange(3))" ] }, { "cell_type": "code", "execution_count": null, "id": "aa763892-3f7d-4f8f-a7c6-b6fae3d3649e", "metadata": {}, "outputs": [], "source": [ "ski_resorts.take(np.arange(ski_resorts.num_rows-3, ski_resorts.num_rows))" ] }, { "cell_type": "code", "execution_count": null, "id": "f86e03ca-2383-43d5-98b0-51baa7bb6f54", "metadata": {}, "outputs": [], "source": [ "snow_bins = np.arange(0, 850, 50)\n", "ski_resorts.select(\"Total Snowfall\").hist(bins=snow_bins)" ] }, { "cell_type": "code", "execution_count": null, "id": "888a2377-8cbe-4a94-abc6-2847213ab7fe", "metadata": {}, "outputs": [], "source": [ "# Although we have access to the full data set, assume that the data scientist does not!\n", "total_snowfall_median = percentile(50, ski_resorts.column(\"Total Snowfall\"))\n", "total_snowfall_median" ] }, { "cell_type": "markdown", "id": "e2d7eefe-8f8f-4c64-9639-4958bc8fcc0f", "metadata": {}, "source": [ "The entire sample contains 875 entries. The data scientist must estimate the total_snowfall_median from one sample\n", "of size 200. Here it is:" ] }, { "cell_type": "code", "execution_count": null, "id": "123ffa2a-1708-4d49-9934-506dc56ea556", "metadata": {}, "outputs": [], "source": [ "resort_sample = ski_resorts.sample(200, with_replacement=False)\n", "resort_sample_median = percentile(50, resort_sample.column(\"Total Snowfall\"))\n", "resort_sample_median" ] }, { "cell_type": "code", "execution_count": null, "id": "295ea2b2-3411-46d0-aeee-530e9a5b4a1e", "metadata": {}, "outputs": [], "source": [ "resort_sample.select(\"Total Snowfall\").hist(bins=snow_bins)" ] }, { "cell_type": "markdown", "id": "ed105a08-7905-43de-baf3-0eca7e5e22b2", "metadata": {}, "source": [ "Question - How can the data scientist know if her sample median is a good estimate if she doesn't have access to the original population?" ] }, { "cell_type": "markdown", "id": "8d8e9980-bf52-484e-b6a7-3121d955dc94", "metadata": {}, "source": [ "Answer - Use the bootstrap method!\n", "- Treat the original sample as if it were the population.\n", "- Draw from the sample, at random **with replacement**, the same number of times as the original sample size." ] }, { "cell_type": "code", "execution_count": null, "id": "d706f94f-127f-41fc-8464-f78be5be4cb9", "metadata": {}, "outputs": [], "source": [ "def one_bootstrap_median(original_sample):\n", " resort_resample = original_sample.sample()\n", " # Equivalent to resort_resample = original_sample.sample(original_sample.num_rows, with_replacement=True)\n", " median = percentile(50, resort_resample.column(\"Total Snowfall\"))\n", " return median" ] }, { "cell_type": "code", "execution_count": null, "id": "ea315132-516a-4cfc-a586-3b171d1b7ce3", "metadata": {}, "outputs": [], "source": [ "one_bootstrap_median(resort_sample)" ] }, { "cell_type": "markdown", "id": "ea783375-5e1d-4833-b51c-c9733e523720", "metadata": {}, "source": [ "We need to calculate many bootstrap_medians:" ] }, { "cell_type": "code", "execution_count": null, "id": "8698c46e-3333-46b2-841e-aec8ffa26903", "metadata": {}, "outputs": [], "source": [ "def calculate_medians(original_sample, how_many):\n", " bootstrap_medians = make_array()\n", " for _ in range(how_many):\n", " bootstrap_medians = np.append (bootstrap_medians, one_bootstrap_median(original_sample))\n", " return bootstrap_medians" ] }, { "cell_type": "code", "execution_count": null, "id": "0f1afe6d-9d23-446c-8215-bbece5d7419c", "metadata": {}, "outputs": [], "source": [ "bootstrap_medians = calculate_medians(resort_sample, 10)\n", "bootstrap_medians" ] }, { "cell_type": "code", "execution_count": null, "id": "0f4b65da-38d5-47a6-935a-0fa67a96278b", "metadata": {}, "outputs": [], "source": [ "bootstrap_medians = calculate_medians(resort_sample, 1000)\n", "resampled_medians = Table().with_column('Bootstrap Sample Median', bootstrap_medians)\n", "median_bins=np.arange(119, 219, 5) # We know that 169 is the true population median, the data scientist does not!\n", "resampled_medians.hist(bins = median_bins)" ] }, { "cell_type": "markdown", "id": "167c58fb-6dec-442d-93bb-3d746d3ffbc3", "metadata": {}, "source": [ "Do the estimates capture the parameter? We will say they do if the middle 95% of\n", "the resampled medians contain the true median (169, known to us but not to the data scientist)." ] }, { "cell_type": "code", "execution_count": null, "id": "276991b4-6eb2-4d70-9310-ca60cb00d20d", "metadata": {}, "outputs": [], "source": [ "left = percentile(2.5, bootstrap_medians)\n", "left" ] }, { "cell_type": "code", "execution_count": null, "id": "febbe190-3eb7-4388-aadc-a16211f3cc1f", "metadata": {}, "outputs": [], "source": [ "right = percentile(97.5, bootstrap_medians)\n", "right" ] }, { "cell_type": "markdown", "id": "a0741716-e789-4a84-91e2-df1733492219", "metadata": {}, "source": [ "Is 169 between these two values? If so, was this one trial a fluke? Let's do 100 trials to\n", "see how often these intervals contain the parameter. Bootstrap theory (which is beyond the scope\n", "of this course!) says that this should happen at least 90% of the time with 95% being typical.\n", "Thus, this process of estimation captures the parameter about 95% of the time." ] }, { "cell_type": "code", "execution_count": null, "id": "cd85597c-2796-4cc7-944c-e1257eecfd37", "metadata": {}, "outputs": [], "source": [ "# Full simulation to determine how often 169 is contained in the intervals.\n", "# The simulation is a demonstration of the bootstrap.\n", "\n", "trials = 100\n", "left_ends = make_array()\n", "right_ends = make_array()\n", "\n", "for _ in np.arange(trials):\n", " original_sample = ski_resorts.sample(200, with_replacement=False)\n", " medians = calculate_medians(original_sample, trials)\n", " left_ends = np.append(left_ends, percentile(2.5, medians))\n", " right_ends = np.append(right_ends, percentile(97.5, medians))\n", "\n", "intervals = Table().with_columns(\n", " 'Left', left_ends,\n", " 'Right', right_ends\n", ") \n", "\n", "intervals" ] }, { "cell_type": "code", "execution_count": null, "id": "3515236d-5265-47fd-a526-40e68162fd1c", "metadata": {}, "outputs": [], "source": [ "intervals.where(\n", " 'Left', are.below(total_snowfall_median)).where(\n", " 'Right', are.above(total_snowfall_median)).num_rows" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }