{ "cells": [ { "cell_type": "markdown", "id": "94ade04b-965e-480e-b290-15d11aba125d", "metadata": {}, "source": [ "# Chapter 11.2 - Multiple Categories" ] }, { "cell_type": "code", "execution_count": null, "id": "ae684e6a-3cbd-4b2a-8586-f0ae35c2ddd2", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "12a5917f-b5cd-4e5a-8560-6f22a02e2808", "metadata": {}, "source": [ "Suppose that there are 3 hypothetical categories of the common household fly. Type A occurs 50% of the time,\n", "Type B occurs 30% of the time and Type C occurs 20% of the time. Consider an entomologist who collects a\n", "sample of 100 common household flies and learns that there are 67 of Type A, 25 of Type B and 8 of Type C.\n", "Let's use data science techniques to determine whether this collection is consistent with a random sample." ] }, { "cell_type": "code", "execution_count": null, "id": "a8242103-8b44-4274-b0b4-2377943d5f3f", "metadata": {}, "outputs": [], "source": [ "flies = Table().with_columns(\n", " 'Type', make_array('A', 'B', 'C'),\n", " 'Nature', make_array(0.5, 0.3, 0.2),\n", " 'Collection', make_array(0.67, 0.25, 0.08)\n", ")\n", "flies" ] }, { "cell_type": "code", "execution_count": null, "id": "157de8f1-fb4b-499e-9809-a2346c500123", "metadata": {}, "outputs": [], "source": [ "flies.barh(\"Type\")" ] }, { "cell_type": "markdown", "id": "3dc9197c-e1f7-44de-9f73-6bf2e851a710", "metadata": {}, "source": [ "Let's create one random sample." ] }, { "cell_type": "code", "execution_count": null, "id": "795c71bf-97e3-49ea-8185-1ca564abc432", "metadata": {}, "outputs": [], "source": [ "eligible_population = flies.column('Nature')\n", "sample_distribution = sample_proportions(100, eligible_population)\n", "flies = flies.with_column('Random Sample', sample_distribution)\n", "flies" ] }, { "cell_type": "code", "execution_count": null, "id": "f95abf12-3c70-4364-b9bc-75119e3571a3", "metadata": {}, "outputs": [], "source": [ "flies.barh(\"Type\")" ] }, { "cell_type": "markdown", "id": "04f88467-b8ef-4454-a35b-06ccf720c41b", "metadata": {}, "source": [ "How can we calculate the distance between two distributions?" ] }, { "cell_type": "code", "execution_count": null, "id": "49f5c6e6-53fd-4f75-8062-53a19f4db45c", "metadata": {}, "outputs": [], "source": [ "flies_with_diffs = flies.with_column(\n", " 'Difference', flies.column('Nature') - flies.column('Collection')\n", ")\n", "flies_with_diffs" ] }, { "cell_type": "code", "execution_count": null, "id": "4be82565-db49-4f32-aeaa-17c0f9363d18", "metadata": {}, "outputs": [], "source": [ "flies_with_diffs = flies_with_diffs.with_column(\n", " 'Absolute Difference', np.abs(flies_with_diffs.column('Difference'))\n", ")\n", "flies_with_diffs" ] }, { "cell_type": "code", "execution_count": null, "id": "4c349008-bc63-4773-bcce-87665483b897", "metadata": {}, "outputs": [], "source": [ "flies_with_diffs.column('Absolute Difference').sum() / 2" ] }, { "cell_type": "markdown", "id": "47f0871a-4558-4857-a208-7f96f9c91d13", "metadata": {}, "source": [ "0.17 is the **total variation distance.** Total variation distance can be used as the statistic\n", "to simulate with random selection. Large values will be evidence against random selection." ] }, { "cell_type": "code", "execution_count": null, "id": "8221a64c-ec9f-49cc-86b9-d43ae3a6ab0a", "metadata": {}, "outputs": [], "source": [ "def total_variation_distance(distribution_1, distribution_2):\n", " return sum(np.abs(distribution_1 - distribution_2)) / 2" ] }, { "cell_type": "code", "execution_count": null, "id": "2a765c72-8d1b-4808-91d5-41f30e6eef29", "metadata": {}, "outputs": [], "source": [ "total_variation_distance(flies.column('Nature'), flies.column('Collection'))" ] }, { "cell_type": "code", "execution_count": null, "id": "9ab56da9-90e2-4b01-aea4-43df493a15f5", "metadata": {}, "outputs": [], "source": [ "def one_simulated_tvd():\n", " sample_distribution = sample_proportions(100, eligible_population)\n", " return total_variation_distance(sample_distribution, eligible_population)\n", "one_simulated_tvd()" ] }, { "cell_type": "code", "execution_count": null, "id": "d4da7750-b40c-469a-9079-6f401ecd9364", "metadata": {}, "outputs": [], "source": [ "def many_simulated_tvds(how_many):\n", " tvds = make_array()\n", " for i in np.arange(how_many):\n", " tvds = np.append(tvds, one_simulated_tvd())\n", " return tvds\n", "many_simulated_tvds(10)" ] }, { "cell_type": "code", "execution_count": null, "id": "8e7c0219-ae96-4c50-8d56-84555a22e67b", "metadata": {}, "outputs": [], "source": [ "tvds = many_simulated_tvds(100000)" ] }, { "cell_type": "code", "execution_count": null, "id": "d2df7d45-3a02-4160-832e-9e7bf29e4241", "metadata": {}, "outputs": [], "source": [ "Table().with_column('TVD', tvds).hist(bins=np.arange(0, 0.2, 0.01))" ] }, { "cell_type": "markdown", "id": "b7306b52-8ee8-46fa-971f-b463e61a4e8b", "metadata": {}, "source": [ "The simulation shows that the composition of the common house fly collection is not consistent with \n", "the model of random selection." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }