{ "cells": [ { "cell_type": "markdown", "id": "c9a4b8a7-d0b3-4098-a9b8-e9464c454bf8", "metadata": {}, "source": [ "# Chapter 13.4 - Using Confidence Intervals" ] }, { "cell_type": "markdown", "id": "9caa463e-e059-4aba-b3c7-fcdea71619cf", "metadata": {}, "source": [ "## Repeated information from Chapter 13.3, Confidence Intervals:" ] }, { "cell_type": "code", "execution_count": null, "id": "d5516087-9b25-409f-94d5-f11dd0c167c0", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plots" ] }, { "cell_type": "code", "execution_count": null, "id": "324be9b7-0e03-4db1-9928-17de96c0a99a", "metadata": {}, "outputs": [], "source": [ "# Place the csv file in the same directory as this notebook\n", "ski_resorts = Table().read_table(\"ski_resorts.csv\")\n", "ski_resorts.show(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "3501b588-5066-4da4-bcd7-1e5e9b385082", "metadata": {}, "outputs": [], "source": [ "ski_resorts.hist(\"Total Snowfall\")" ] }, { "cell_type": "code", "execution_count": null, "id": "aa763892-3f7d-4f8f-a7c6-b6fae3d3649e", "metadata": {}, "outputs": [], "source": [ "# Assume the 875 entries in our csv file are pulled from a much larger sample\n", "def one_bootstrap_mean():\n", " resample = ski_resorts.sample()\n", " return np.average(resample.column('Total Snowfall'))" ] }, { "cell_type": "code", "execution_count": null, "id": "f86e03ca-2383-43d5-98b0-51baa7bb6f54", "metadata": {}, "outputs": [], "source": [ "# Generate many means from bootstrap samples\n", "def many_bootstrap_means(how_many):\n", " bootstrap_means = make_array()\n", " for _ in np.arange(how_many):\n", " bootstrap_means = np.append(bootstrap_means, one_bootstrap_mean())\n", " return bootstrap_means" ] }, { "cell_type": "code", "execution_count": null, "id": "888a2377-8cbe-4a94-abc6-2847213ab7fe", "metadata": {}, "outputs": [], "source": [ "# Obtain endpoints of the 95% confidence interval\n", "bootstrap_means = many_bootstrap_means(1000)\n", "left = percentile(2.5, bootstrap_means)\n", "right = percentile(97.5, bootstrap_means)\n", "make_array(left, right)" ] }, { "cell_type": "markdown", "id": "33f0bc3a-0246-4999-90e1-65bca3da8025", "metadata": {}, "source": [ "The array endpoints show the 95% confidence interval for the mean Total Snowfall.\n", "Here is a histogram to help visualize:" ] }, { "cell_type": "code", "execution_count": null, "id": "123ffa2a-1708-4d49-9934-506dc56ea556", "metadata": {}, "outputs": [], "source": [ "resampled_means = Table().with_column('Bootstrap Sample Mean', bootstrap_means)\n", "resampled_means.hist(bins=20, unit=\"Inches\")\n", "plots.plot([left, right], [0, 0], color='yellow', lw=8);" ] }, { "cell_type": "markdown", "id": "08f8bc23-cb82-4602-a13f-fd059be7592d", "metadata": {}, "source": [ "## An Incorrect Use of a Confidence Interval" ] }, { "cell_type": "markdown", "id": "32ffeda8-da28-47a4-8378-57d160ea40cb", "metadata": {}, "source": [ "Avoid the common mistake of incorrectly using the confidence interval.\n", "For example, it is incorrect to conclude that 95% of the ski resorts have a total snowfall \n", "between the interval of [left, right] found above. Why is this?" ] }, { "cell_type": "code", "execution_count": null, "id": "2be934c8-8d61-4463-82d0-321225cb99a4", "metadata": {}, "outputs": [], "source": [ "low_bound = left\n", "high_bound = right\n", "reduced_ski_resorts = ski_resorts.where(\"Total Snowfall\", are.above_or_equal_to(low_bound))\n", "reduced_ski_resorts = reduced_ski_resorts.where(\"Total Snowfall\", are.below_or_equal_to(high_bound))\n", "print(\"The percentage of ski resorts in this interval = {:.2f}%.\".format(reduced_ski_resorts.num_rows / ski_resorts.num_rows * 100))" ] }, { "cell_type": "markdown", "id": "d909ab75-dd69-430c-a92c-ebec64c636c8", "metadata": {}, "source": [ "## A Correct Use of a Confidence Interval" ] }, { "cell_type": "markdown", "id": "1248fadb-822e-48a9-88c7-52d063fc2973", "metadata": {}, "source": [ "But we can use a confidence interval to test a hypothesis!\n", "- **Null Hypothesis** - The average total snowfall in the population is 100\n", "- **Alternative Hypothesis** - The average total snowfall in the population is not 100\n", "\n", "The null hypothesis can be rejected since it is not in the 95% confidence interval." ] }, { "cell_type": "markdown", "id": "135c7e57-22cb-4bba-8e60-9c44005ff510", "metadata": {}, "source": [ "## Another Correct Use of a Confidence Interval" ] }, { "cell_type": "markdown", "id": "def5efce-5143-4565-92d1-d41ed3c69407", "metadata": {}, "source": [ "Here is another example. Let the **Null Hypothesis** be that the Average Summit Depth\n", "is no more than 10 inches greater than the Average Base Depth. (Note: these two numbers are *paired*.) \n", "To reject this hypothesis with 99% confidence, we can use the bootstrap method." ] }, { "cell_type": "code", "execution_count": null, "id": "849762ed-04ec-43bd-be52-510a395d249d", "metadata": {}, "outputs": [], "source": [ "depth_table = ski_resorts.select(\"Average Base Depth\", \"Average Summit Depth\")\n", "depth_table = depth_table.with_column(\"Difference\", \n", " depth_table.column(\"Average Summit Depth\") - depth_table.column(\"Average Base Depth\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "3cdecd1a-e345-46e3-8632-cec92ff758d7", "metadata": {}, "outputs": [], "source": [ "print(\"The average difference is {:.2f} inches.\".format(np.average(depth_table.column(\"Difference\"))))" ] }, { "cell_type": "code", "execution_count": null, "id": "201187e5-1fc0-400d-8d48-9e6c50c199f7", "metadata": {}, "outputs": [], "source": [ "def one_bootstrap_mean():\n", " resample = depth_table.sample()\n", " return np.average(resample.column('Difference'))" ] }, { "cell_type": "code", "execution_count": null, "id": "6ab44d34-568d-4c5f-8898-b2b8b3f2d1e5", "metadata": {}, "outputs": [], "source": [ "# Generate many bootstrap means\n", "def many_bootstrap_means(num_repetitions):\n", " bstrap_means = make_array()\n", " for _ in np.arange(num_repetitions):\n", " bstrap_means = np.append(bstrap_means, one_bootstrap_mean())\n", " return bstrap_means" ] }, { "cell_type": "code", "execution_count": null, "id": "d9172891-7170-420d-9caf-dbc0c311da91", "metadata": {}, "outputs": [], "source": [ "# Get the endpoints of the 99% confidence interval\n", "bstrap_means = many_bootstrap_means(1000)\n", "left = percentile(0.5, bstrap_means)\n", "right = percentile(99.5, bstrap_means)\n", "make_array(left, right)" ] }, { "cell_type": "code", "execution_count": null, "id": "5c24626f-cb37-493e-971c-0a4d10c1f67c", "metadata": {}, "outputs": [], "source": [ "resampled_means = Table().with_columns(\n", " 'Bootstrap Sample Mean', bstrap_means\n", ")\n", "resampled_means.hist()\n", "plots.plot([left, right], [0, 0], color='yellow', lw=8);" ] }, { "cell_type": "markdown", "id": "508c31ee-648d-4286-82b8-0a5b6c3c82b9", "metadata": {}, "source": [ "Notes:\n", "- The higher we want our confidence to be, the larger the interval becomes\n", "- We have done better than simply concluding that we can reject the null hypothesis. We have estimated how big the average difference is. That’s a more useful result than just saying, “It’s not 10 inches or less.”" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }