# Chapter 14.3 - The SD and the Normal Curve

In [None]:
from datascience import *
import numpy as np
%matplotlib inline

In [None]:
song_lengths = Table.read_table("top_spotify_songs_usa.csv").select("duration_ms")
song_lengths.show(3)

In [None]:
true_mean = np.mean(song_lengths.column("duration_ms"))
true_std = np.std(song_lengths.column("duration_ms"))
print("True Mean = {:.0f}, True STD = {:.0f}".format(true_mean, true_std))

In [None]:
num_samples = 50
repetitions = 10000

average_song_lengths = make_array()

for _ in np.arange(repetitions):
 lengths = song_lengths.sample(num_samples)
 new_average_length = int(np.round((np.mean(lengths.column('duration_ms')))))
 average_song_lengths = np.append(average_song_lengths, new_average_length)

results = Table().with_column('Average Song Length', average_song_lengths)
results.show(3)

In [None]:
results.hist("Average Song Length", bins = np.arange(170000, 235001, 500))

**Conceptual Questions**
- How do you expect the mean of the results table to compare with the mean of the song_lengths table?
- How do you expect the std of the results table to compare with the std of the song_lengths table?

In [None]:
mean_of_samples = np.mean(results.column("Average Song Length"))
std_of_samples = np.std(results.column("Average Song Length"))
print("Mean of Samples = {:.0f}, STD of Samples = {:.0f}".format(mean_of_samples, std_of_samples))

# Chapter 14.4 - The Central Limit Theorem

**Central Limit Theorem** - the probability distribution of the sum or average of a large random sample drawn with replacement 
will be roughly normal, *regardless of the distribution of the population from which the sample is drawn*.

Standard Normal **cdf** (Cumulative Distribution Function)

In [None]:
from scipy import stats

For normal distributions, the amount of data captured within 1 standard deviation of the mean is:

In [None]:
# Reminder: Cherbychev's bound states that it will be at least 0%
print("{:.2f}%".format(100 * (stats.norm.cdf(1) - stats.norm.cdf(-1))))

**Active Learning**: Use python to 
- Print the lowest value in the results table above that is within 1 STD of the mean
- Print the highest value in the results table above that is within 1 STD of the mean.


In [None]:
# Place answer here.

For normal distributions, the amount of data captured within 2 standard deviations of the mean is:

In [None]:
# Reminder: Cherbychev's bound states that it will be at least 75%
print("{:.2f}%".format(100 * (stats.norm.cdf(2) - stats.norm.cdf(-2))))

For normal distributions, the amount of data captured within 3 standard deviations of the mean is:

In [None]:
# Reminder: Cherbychev's bound states that it will be at least 88.89%
print("{:.2f}%".format(100 * (stats.norm.cdf(3) - stats.norm.cdf(-3))))

In general, for bell-shaped distributions, the SD is the distance between the mean and the 
points of inflection on either side.