# Chapter 11.2 - Multiple Categories

In [None]:
from datascience import *
import numpy as np
%matplotlib inline

Suppose that there are 3 hypothetical categories of the common household fly. Type A occurs 50% of the time,
Type B occurs 30% of the time and Type C occurs 20% of the time. Consider an entomologist who collects a
sample of 100 common household flies and learns that there are 67 of Type A, 25 of Type B and 8 of Type C.
Let's use data science techniques to determine whether this collection is consistent with a random sample.

In [None]:
flies = Table().with_columns(
 'Type', make_array('A', 'B', 'C'),
 'Nature', make_array(0.5, 0.3, 0.2),
 'Collection', make_array(0.67, 0.25, 0.08)
)
flies

In [None]:
flies.barh("Type")

Let's create one random sample.

In [None]:
eligible_population = flies.column('Nature')
sample_distribution = sample_proportions(100, eligible_population)
flies = flies.with_column('Random Sample', sample_distribution)
flies

In [None]:
flies.barh("Type")

How can we calculate the distance between two distributions?

In [None]:
flies_with_diffs = flies.with_column(
 'Difference', flies.column('Nature') - flies.column('Collection')
)
flies_with_diffs

In [None]:
flies_with_diffs = flies_with_diffs.with_column(
 'Absolute Difference', np.abs(flies_with_diffs.column('Difference'))
)
flies_with_diffs

In [None]:
flies_with_diffs.column('Absolute Difference').sum() / 2

0.17 is the **total variation distance.** Total variation distance can be used as the statistic
to simulate with random selection. Large values will be evidence against random selection.

In [None]:
def total_variation_distance(distribution_1, distribution_2):
 return sum(np.abs(distribution_1 - distribution_2)) / 2

In [None]:
total_variation_distance(flies.column('Nature'), flies.column('Collection'))

In [None]:
def one_simulated_tvd():
 sample_distribution = sample_proportions(100, eligible_population)
 return total_variation_distance(sample_distribution, eligible_population)
one_simulated_tvd()

In [None]:
def many_simulated_tvds(how_many):
 tvds = make_array()
 for i in np.arange(how_many):
 tvds = np.append(tvds, one_simulated_tvd())
 return tvds
many_simulated_tvds(10)

In [None]:
tvds = many_simulated_tvds(100000)

In [None]:
Table().with_column('TVD', tvds).hist(bins=np.arange(0, 0.2, 0.01))

The simulation shows that the composition of the common house fly collection is not consistent with 
the model of random selection.