{ "cells": [ { "cell_type": "markdown", "id": "50cd404d-daf6-4f6e-91a5-199d15dcef7a", "metadata": {}, "source": [ "# Chapter 11.3 - Decisions and Uncertainty" ] }, { "cell_type": "code", "execution_count": null, "id": "f09abf82-8015-4b1d-9df3-72330e667f03", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "import matplotlib.pyplot as plots\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "7ea8353b-6caf-4e78-ad2e-ab1abd191580", "metadata": {}, "source": [ "All statistical tests attempt to choose between two views of the world. Specifically, the choice is between two views about how the data were generated. \n", "These two views are called *hypotheses*." ] }, { "cell_type": "markdown", "id": "4fe6a608-65fd-48ea-a587-002a151656da", "metadata": {}, "source": [ "**Null Hypothesis** - the data were generated at random under clearly specified assumptions about the randomness.\n", "So called because if the data look different from what the null hypothesis predicts, the difference is due to *nothing* but chance." ] }, { "cell_type": "markdown", "id": "a07221e4-4da5-40be-8495-d26545918a82", "metadata": {}, "source": [ "**Alternative Hypothesis** - some reason other than chance made the data differ from the predictions of the model in the null hypothesis." ] }, { "cell_type": "markdown", "id": "2d3abb78-bd90-464b-ae91-1fd912d1f023", "metadata": {}, "source": [ "**Test Statistic** - statistic used to choose between the two hypotheses." ] }, { "cell_type": "code", "execution_count": null, "id": "dec85e6e-1d44-4447-a709-b68bf5c65edb", "metadata": {}, "outputs": [], "source": [ "# Gregory Mendel's model is that each pea plant has a 75% (25%) chance of\n", "# producing a purple (white) flower, independent of all other plants. \n", "# 705 out of 929 pea plants produced purple flowers.\n", "total_variation_distance = abs(100*(705 / 929) - 75)\n", "observed_statistic = total_variation_distance\n", "observed_statistic" ] }, { "cell_type": "code", "execution_count": null, "id": "a675e3bb-039f-4630-bfc5-bec8a1046e9a", "metadata": {}, "outputs": [], "source": [ "mendel_proportions = make_array(0.75, 0.25)\n", "mendel_proportion_purple = mendel_proportions.item(0)\n", "sample_size = 929" ] }, { "cell_type": "code", "execution_count": null, "id": "9e99b9cd-fc67-44b1-bdd0-4d354eff78b8", "metadata": {}, "outputs": [], "source": [ "def one_simulated_distance():\n", " sample_proportion_purple = sample_proportions(929, mendel_proportions).item(0)\n", " return 100 * abs(sample_proportion_purple - mendel_proportion_purple)" ] }, { "cell_type": "code", "execution_count": null, "id": "f094be5f-cfdb-4ff5-9a21-6014ecf4ede9", "metadata": {}, "outputs": [], "source": [ "one_simulated_distance()" ] }, { "cell_type": "code", "execution_count": null, "id": "b2dfe8b6-9bcd-436f-b507-6ed2277891b2", "metadata": {}, "outputs": [], "source": [ "def multiple_simulated_distances(how_many):\n", " distances = make_array()\n", " for i in np.arange(how_many):\n", " distances = np.append(distances, one_simulated_distance())\n", " return distances" ] }, { "cell_type": "code", "execution_count": null, "id": "366ab83b-f7cf-4cd9-8207-e72549e0ff2b", "metadata": {}, "outputs": [], "source": [ "distances = multiple_simulated_distances(10000)\n", "distances" ] }, { "cell_type": "code", "execution_count": null, "id": "6b45a6b8-c849-4b14-902f-4597c1af9538", "metadata": {}, "outputs": [], "source": [ "Table().with_column(\n", " 'Distance between Sample % and 75%', distances\n", ").hist()\n", "plots.ylim(-0.02)\n", "plots.title('Prediction Made by the Null Hypothesis')\n", "plots.scatter(observed_statistic, 0, color='red', s=40);" ] }, { "cell_type": "markdown", "id": "df1aeb57-6abe-4c93-b883-cb1f73444743", "metadata": {}, "source": [ "The above graph shows that the data is **consistent with** (as opposed to **rejected by**) the null hypothesis." ] }, { "cell_type": "code", "execution_count": null, "id": "a755ff57-6eeb-4904-8f45-09aad8483db1", "metadata": { "scrolled": true }, "outputs": [], "source": [ "percent = 100 * (np.count_nonzero(distances >= observed_statistic) / 10000)\n", "print(\"There is a {:.2f} percent change that the test statistic is {:.2f} or greater.\".format(percent, observed_statistic))" ] }, { "cell_type": "markdown", "id": "c6f8f000-0bbf-47ff-8f3b-adcb642a083b", "metadata": {}, "source": [ "**p-value (observed significance level) of a test** - the chance, based on the model in the null hypothesis, that the test statistic will be \n", "equal to the observed value in the sample or even further in the direction that supports the alternative.\n", "- If the p-value is less than 5%, it is considered small and the result is called *statistically significant*.\n", "- If the p-value is even smaller – less than 1% – the result is called *highly statistically significant*." ] }, { "cell_type": "markdown", "id": "0f510154-07e3-4f76-9948-dd0a56faceef", "metadata": {}, "source": [ "Errors can occur when (1) the test favors the alternative hypothesis when in fact the null hypothesis is true or\n", "(2) the test favors the null hypothesis when in fact the alternative hypothesis is true." ] }, { "cell_type": "markdown", "id": "5d625692-30a0-4ef2-9345-3f93736f8ea6", "metadata": {}, "source": [ "**Fact** - If you use a p% cutoff for the p-value, and the null hypothesis happens to be true, \n", "then there is about a p% chance that your test will conclude that the alternative is true." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }