# Chapter 14.5 - Variability of the Sample Mean

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots

In [None]:
song_lengths = Table.read_table("top_spotify_songs_usa.csv").select("duration_ms")
song_lengths = song_lengths.with_column("seconds", song_lengths.column("duration_ms") / 1000)
song_lengths = song_lengths.drop("duration_ms")
song_lengths.show(5)

In [None]:
print("Mean seconds:", np.mean(song_lengths.column("seconds")))
print("Minimum seconds:", np.min(song_lengths.column("seconds")))
print("Maximum seconds:", np.max(song_lengths.column("seconds")))

In [None]:
Table().with_column("Seconds", song_lengths.column("seconds")).hist(bins=np.arange(30, 521, 20))

In [None]:
def simulate_sample_mean(table, label, sample_size, repetitions):
 means = make_array()
 for _ in range(repetitions):
 new_sample = table.sample(sample_size)
 new_sample_mean = np.mean(new_sample.column(label))
 means = np.append(means, new_sample_mean)
 sample_means = Table().with_column('Sample Means', means)
 
 # Display empirical histogram and print all relevant quantities
 sample_means.hist(bins=20)
 plots.xlabel('Sample Means')
 plots.title('Sample Size ' + str(sample_size))
 population_std = np.std(table.column(label))
 print("1. Sample size: ", sample_size)
 print("2. Population mean:", np.mean(table.column(label)))
 print("3. Average of sample means: ", np.mean(means))
 print("4. Population SD:", population_std)
 print("5. SD of sample means:", np.std(means))
 print("6. Population SD / sqrt(sample size)", population_std / np.sqrt(sample_size), "\n")

In [None]:
for sample_size in range(100, 501, 200):
 simulate_sample_mean(song_lengths, "seconds", sample_size, 10000)

What observations can you make from the above graphs regarding
- The population mean and standard deviation (#2, #4)
- The average of the sample means (#3)
- The SD of the sample means (#5)
- The ratio of the population SD to the square root of the sample size (#6)
- The relationship between #5 and #6

**Statistics Theory** - If the sample size is fixed and the samples are drawn at random with replacement 
from the population, then

 SD of all possible sample means = population SD / sqrt(sample size)

The SD of all possible sample means measures how variable the sample mean can be. *As such, 
it is a measure of the accuracy of the sample mean as an estimate of the population mean*. 
The smaller the SD, the more accurate the estimate.

The formula shows that:
- The population size doesn’t affect the accuracy of the sample mean. 
The population size doesn’t appear anywhere in the formula.
- The population SD is a constant; it’s the same for every sample drawn from the population. The sample size can be varied. 
Because the sample size appears in the denominator, the variability of the sample mean decreases as the sample size increases, 
and hence the accuracy increases.

Thus in general, when you multiply the sample size by a factor, the accuracy of the sample mean goes up 
by the square root of that factor.

# Chapter 14.6 - Choosing a Sample Size

Suppose we want to sample the population of the common housefly to determine how often
a certain proboscis shape appears. If a 95% confidence interval is desired with an
accuracy level of at least 10%, how many samples do we need?

The **Central Limit Theorem** tells us that for normally distributed variables, 
the interval “center +/- 2 SD" contains 95% of the data.

Since it is reasonable to assume we are drawing from a population with replacement and the sample size will be fixed,
we can plug into the formula from 14.5: 

 (4 * Population SD) / sqrt(sample size) <= .10

**or equivalently:** 

 sample size >= (40 * Population SD) ** 2

**More Statistics Theory** - The Population SD can never be more than 0.5.

Thus, we can guarantee our desired result if

 sample size >= (40 *.5) ** 2 
 sample size >= 20 ** 2
 sample size >= 400

**Test Your Understanding** - How large should the sample be if a 5% level of accuracy is desired?