{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "80afea6a-dc8f-4397-a096-931a665157d6",
   "metadata": {},
   "source": [
    "# Chapter 12.2 - Causality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e015d95b-51d0-4b16-a048-e241b3e47acf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datascience import *\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plots"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53a2905a-42fa-49c6-9fe7-c4c99cabfce1",
   "metadata": {},
   "source": [
    "Suppose 16 students take CSCI 291 and are randomly broken into two groups\n",
    "where students in group A and group B are taught differently.  Students in group\n",
    "A receive enhanced educational opportunities to work with the material.  Thus,\n",
    "students in group A are the **treatment group** and students in group B are the\n",
    "**control group**.  The table shows the grades earned by the students."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e81ff54-34e7-47c3-8c82-c5538df9a2b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "grades = Table().with_columns(\n",
    "    \"Group\", make_array(\"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"A\", \"B\", \"B\", \"B\", \"B\", \"B\", \"B\", \"B\"),\n",
    "    \"Grade\", make_array(3.3, 4.0, 3.7, 3.7, 4.0, 3.3, 3.0, 3.7, 4.0, 2.0, 3.7, 2.3, 3.0, 2.7, 2.7, 3.0)\n",
    ")\n",
    "grades"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eeca95fa-6450-4042-8923-93b3aaa41550",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "grades.group(\"Group\", np.average)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d798fbfc-9cbc-432f-8e44-64e05fd046b9",
   "metadata": {},
   "source": [
    "It appears that the treatment makes a difference, but how likely is that due to chance?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26549525-7d81-4aba-9169-878c355e4c96",
   "metadata": {},
   "outputs": [],
   "source": [
    "observed_outcomes = Table().with_columns(\n",
    "    \"Group\", make_array(\"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \"Treatment\", \n",
    "                        \"Control\", \"Control\", \"Control\", \"Control\", \"Control\", \"Control\", \"Control\"),\n",
    "    \"In Treatment\", make_array(3.3, 4.0, 3.7, 3.7, 4.0, 3.3, 3.0, 3.7, 4.0,\n",
    "                               \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\"),\n",
    "    \"In Control\", make_array(\"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\", \"Unknown\",\n",
    "                             2.0, 3.7, 2.3, 3.0, 2.7, 2.7, 3.0)\n",
    ")\n",
    "observed_outcomes.show(16)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1984891b-a8e0-481d-8672-0e90ad5f08bc",
   "metadata": {},
   "source": [
    "- **Null hypothesis**: The distribution of the treatment outcomes is the same as that of the control outcomes. \n",
    "The different educational experience makes no difference; the difference in the two samples is just due to chance.\n",
    "- **Alternative hypothesis**: The distribution of the treatment outcomes is different from that of the control outcomes. \n",
    "The treatment does something different from the control."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06504939-f695-481c-b630-78b82bb86385",
   "metadata": {},
   "outputs": [],
   "source": [
    "def distance(table, group_label, column_label):\n",
    "    gpa = table.group(group_label, np.average).column(column_label)\n",
    "    gpa_distance = abs(gpa[0] - gpa[1])\n",
    "    return gpa_distance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ab3bdc9-d918-4cd4-85e4-8a3b7e451cb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "distance(grades, \"Group\", \"Grade average\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "114dcd10-bc76-4ba4-954e-f0a4459df6fb",
   "metadata": {},
   "source": [
    "The **test statistic** of interest is the distance between each group's average GPA.  \n",
    "To test the statistic under the null hypothesis, we can randomly permute the Group labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc073d4c-1fa9-4d48-9a05-09201533cce2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def one_simulated_difference():\n",
    "    shuffled_labels = grades.sample(with_replacement = False).column('Group')\n",
    "    original_and_shuffled = grades.with_column('Shuffled Label', shuffled_labels)\n",
    "    return distance(original_and_shuffled, \"Shuffled Label\", \"Grade average\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d27ca811-f147-4c5a-b02a-342a2bbc3ac1",
   "metadata": {},
   "outputs": [],
   "source": [
    "one_simulated_difference()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f5b3e43-e91a-45b6-9ad0-6f36e48dbd00",
   "metadata": {},
   "outputs": [],
   "source": [
    "def many_simulated_differences(how_many):\n",
    "    differences = make_array()\n",
    "    for i in np.arange(how_many):\n",
    "        new_difference = one_simulated_difference()\n",
    "        differences = np.append(differences, new_difference)\n",
    "    return differences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e0cd01c-7172-400f-bb12-ddabf14c831e",
   "metadata": {},
   "outputs": [],
   "source": [
    "x = many_simulated_differences(10)\n",
    "x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9ef82e9-76c6-4763-a943-09b0eec5e560",
   "metadata": {},
   "outputs": [],
   "source": [
    "observed_distance = distance(grades, \"Group\", \"Grade average\")\n",
    "repetitions = 250\n",
    "differences = many_simulated_differences(repetitions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a6f08c6-4c15-4386-bed3-7b01e7e65151",
   "metadata": {},
   "outputs": [],
   "source": [
    "Table().with_column('Difference Between Group Means', differences).hist(bins = np.arange(-0.05, 1.05, .1))\n",
    "plots.scatter(observed_distance, 0, color='red')\n",
    "plots.title('Prediction Under the Null Hypothesis')\n",
    "print('Observed Difference:', observed_distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4edad11b-11e8-4386-863b-8535a14dc71e",
   "metadata": {},
   "outputs": [],
   "source": [
    "empirical_p = np.count_nonzero(differences >= observed_distance) / repetitions\n",
    "empirical_p"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3843118-f5a0-4233-bfc6-406fbdd6b121",
   "metadata": {},
   "source": [
    "### Conclusions ###\n",
    "- The result is statistically significant. The test favors the alternative hypothesis over the null. \n",
    "The evidence supports the hypothesis that the treatment is doing something.\n",
    "- Because the trials were randomized, the test is evidence that the treatment *causes* the difference. \n",
    "The random assignment of students to the two groups ensures that there is no confounding variable that could affect the conclusion of causality.\n",
    "- If the treatment had not been randomly assigned, our test would still point toward an association between the treatment and \n",
    "the educational outcomes among the students."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}