{ "cells": [ { "cell_type": "markdown", "id": "80afea6a-dc8f-4397-a096-931a665157d6", "metadata": {}, "source": [ "# Chapter 12.2 - Causality" ] }, { "cell_type": "code", "execution_count": null, "id": "e015d95b-51d0-4b16-a048-e241b3e47acf", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plots" ] }, { "cell_type": "markdown", "id": "53a2905a-42fa-49c6-9fe7-c4c99cabfce1", "metadata": {}, "source": [ "Suppose 16 students take CSCI 291 and are randomly broken into two groups\n", "where students in group A and group B are taught differently. Students in group\n", "A receive enhanced educational opportunities to work with the material. Thus,\n", "students in group A are the **treatment group** and students in group B are the\n", "**control group**. The table shows the grades earned by the students." ] }, { "cell_type": "code", "execution_count": null, "id": "0e81ff54-34e7-47c3-8c82-c5538df9a2b8", "metadata": {}, "outputs": [], "source": [ "grades = Table().with_columns(\n", " \"Group\", make_array(\"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"B\", \"B\", \"B\", \"B\", \"B\", \"B\", \"B\"),\n", " \"Grade\", make_array(3.3, 4.0, 3.7, 3.7, 4.0, 3.3, 3.0, 3.7, 4.0, 2.0, 3.7, 2.3, 3.0, 2.7, 2.7, 3.0)\n", ")\n", "grades" ] }, { "cell_type": "code", "execution_count": null, "id": "eeca95fa-6450-4042-8923-93b3aaa41550", "metadata": { "scrolled": true }, "outputs": [], "source": [ "grades.group(\"Group\", np.average)" ] }, { "cell_type": "markdown", "id": "d798fbfc-9cbc-432f-8e44-64e05fd046b9", "metadata": {}, "source": [ "It appears that the treatment makes a difference, but how likely is that due to chance?" ] }, { "cell_type": "code", "execution_count": null, "id": "26549525-7d81-4aba-9169-878c355e4c96", "metadata": {}, "outputs": [], "source": [ "observed_outcomes = Table().with_columns(\n", " \"Group\", make_array(\"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \n", " \"Control\", \"Control\", \"Control\", \"Control\", \"Control\", \"Control\", \"Control\"),\n", " \"In Treatment\", make_array(3.3, 4.0, 3.7, 3.7, 4.0, 3.3, 3.0, 3.7, 4.0,\n", " \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\"),\n", " \"In Control\", make_array(\"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\",\n", " 2.0, 3.7, 2.3, 3.0, 2.7, 2.7, 3.0)\n", ")\n", "observed_outcomes.show(16)" ] }, { "cell_type": "markdown", "id": "1984891b-a8e0-481d-8672-0e90ad5f08bc", "metadata": {}, "source": [ "- **Null hypothesis**: The distribution of the treatment outcomes is the same as that of the control outcomes. \n", "The different educational experience makes no difference; the difference in the two samples is just due to chance.\n", "- **Alternative hypothesis**: The distribution of the treatment outcomes is different from that of the control outcomes. \n", "The treatment does something different from the control." ] }, { "cell_type": "code", "execution_count": null, "id": "06504939-f695-481c-b630-78b82bb86385", "metadata": {}, "outputs": [], "source": [ "def distance(table, group_label, column_label):\n", " gpa = table.group(group_label, np.average).column(column_label)\n", " gpa_distance = abs(gpa[0] - gpa[1])\n", " return gpa_distance" ] }, { "cell_type": "code", "execution_count": null, "id": "6ab3bdc9-d918-4cd4-85e4-8a3b7e451cb9", "metadata": {}, "outputs": [], "source": [ "distance(grades, \"Group\", \"Grade average\")" ] }, { "cell_type": "markdown", "id": "114dcd10-bc76-4ba4-954e-f0a4459df6fb", "metadata": {}, "source": [ "The **test statistic** of interest is the distance between each group's average GPA. \n", "To test the statistic under the null hypothesis, we can randomly permute the Group labels." ] }, { "cell_type": "code", "execution_count": null, "id": "fc073d4c-1fa9-4d48-9a05-09201533cce2", "metadata": {}, "outputs": [], "source": [ "def one_simulated_difference():\n", " shuffled_labels = grades.sample(with_replacement = False).column('Group')\n", " original_and_shuffled = grades.with_column('Shuffled Label', shuffled_labels)\n", " return distance(original_and_shuffled, \"Shuffled Label\", \"Grade average\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d27ca811-f147-4c5a-b02a-342a2bbc3ac1", "metadata": {}, "outputs": [], "source": [ "one_simulated_difference()" ] }, { "cell_type": "code", "execution_count": null, "id": "9f5b3e43-e91a-45b6-9ad0-6f36e48dbd00", "metadata": {}, "outputs": [], "source": [ "def many_simulated_differences(how_many):\n", " differences = make_array()\n", " for i in np.arange(how_many):\n", " new_difference = one_simulated_difference()\n", " differences = np.append(differences, new_difference)\n", " return differences" ] }, { "cell_type": "code", "execution_count": null, "id": "0e0cd01c-7172-400f-bb12-ddabf14c831e", "metadata": {}, "outputs": [], "source": [ "x = many_simulated_differences(10)\n", "x" ] }, { "cell_type": "code", "execution_count": null, "id": "c9ef82e9-76c6-4763-a943-09b0eec5e560", "metadata": {}, "outputs": [], "source": [ "observed_distance = distance(grades, \"Group\", \"Grade average\")\n", "repetitions = 250\n", "differences = many_simulated_differences(repetitions)" ] }, { "cell_type": "code", "execution_count": null, "id": "3a6f08c6-4c15-4386-bed3-7b01e7e65151", "metadata": {}, "outputs": [], "source": [ "Table().with_column('Difference Between Group Means', differences).hist(bins = np.arange(-0.05, 1.05, .1))\n", "plots.scatter(observed_distance, 0, color='red')\n", "plots.title('Prediction Under the Null Hypothesis')\n", "print('Observed Difference:', observed_distance)" ] }, { "cell_type": "code", "execution_count": null, "id": "4edad11b-11e8-4386-863b-8535a14dc71e", "metadata": {}, "outputs": [], "source": [ "empirical_p = np.count_nonzero(differences >= observed_distance) / repetitions\n", "empirical_p" ] }, { "cell_type": "markdown", "id": "f3843118-f5a0-4233-bfc6-406fbdd6b121", "metadata": {}, "source": [ "### Conclusions ###\n", "- The result is statistically significant. The test favors the alternative hypothesis over the null. \n", "The evidence supports the hypothesis that the treatment is doing something.\n", "- Because the trials were randomized, the test is evidence that the treatment *causes* the difference. \n", "The random assignment of students to the two groups ensures that there is no confounding variable that could affect the conclusion of causality.\n", "- If the treatment had not been randomly assigned, our test would still point toward an association between the treatment and \n", "the educational outcomes among the students." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }