# Chapter 14 - Why the Mean Matters

In [None]:
from datascience import *
import numpy as np
%matplotlib inline

## 14.1 - Properties of the Mean

**Average** or **Mean** of a collection of numbers: the sum of all the elements of the collection, 
divided by the number of elements in the collection.

In [None]:
heights = make_array(66, 73, 65, 67)
print(np.average(heights))
print(np.mean(heights))

Proportions are means.

In [None]:
at_least_5_10 = make_array(0, 1, 0, 0)
print(np.mean(at_least_5_10))
at_least_5_10 = make_array(False, True, False, False)
print(np.mean(at_least_5_10))

The mean of a collection depends only on the distinct values and their proportions, not on the number of elements 
in the collection. In other words, the mean of a collection depends only on the distribution of values in the collection.
Therefore, **if two collections have the same distribution, then they have the same mean**.

In [None]:
heights_2 = make_array(65, 66, 67, 73, 73, 67, 66, 65)
np.mean(heights)

The mean is the center of gravity or balance point of the histogram.

In [None]:
heights_table = Table().with_column("Heights", heights)
heights_table.hist("Heights")

In [None]:
percentile(50, heights)

Notice that more than half of the students (75% in this example) can be below average in height.

## 14.2 - Variability

In [None]:
deviations = heights - np.mean(heights)
heights_table = heights_table.with_column("Deviation from Mean", deviations)
heights_table

In [None]:
# Note: the sum of the deviations from average is zero
np.mean(heights_table.column("Deviation from Mean"))

In [None]:
squared_deviations = deviations ** 2
heights_table = heights_table.with_column("Squared Deviation from Average", squared_deviations)
heights_table

In [None]:
variance = np.mean(heights_table.column("Squared Deviation from Average"))
variance

In [None]:
standard_deviation = variance ** 0.5
standard_deviation

**Standard Deviation (SD)** of a list: the root mean square of deviations from average. 

In [None]:
np.std(heights)

**Chebychev’s Bounds**: For all lists, and all numbers *z*,the proportion of entries that are in the range 
“average *+-z* SDs” is at least *1 - 1/z^2*.

Let's examine the variability of the length of songs in our spotify csv file.

In [None]:
song_lengths = Table.read_table("top_spotify_songs_usa.csv").select("duration_ms")
song_lengths.show(5)

To convert a value to **standard units**, compute (value - average) / SD

In [None]:
def standard_units(numbers_array):
 "Convert any array of numbers to standard units."
 return (numbers_array - np.mean(numbers_array))/np.std(numbers_array) 

In [None]:
song_lengths = song_lengths.with_column("Length (Standard Units)", standard_units(song_lengths.column("duration_ms")))
song_lengths.show(5)

In [None]:
song_lengths.sort("duration_ms", descending=True)

In [None]:
song_lengths.sort("duration_ms")

In [None]:
within_3_sd = song_lengths.where('Length (Standard Units)', are.between(-3, 3))
print("Songs within 3 standard deviations of mean: {:.2f}%".format(100 * within_3_sd.num_rows/song_lengths.num_rows))

In [None]:
print("Cherbychev's Bounds predicts this number should be at least {:.2f}%".format((1 - 1/3**2) * 100))

In [None]:
song_lengths.hist('Length (Standard Units)', bins=np.arange(-4, 7, 0.5))