# Chapter 15 - Prediction

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots

In [None]:
# Generate somewhat random data for the equation y = -3x + 250
def generate_table(number_items):
 result = Table(make_array("x", "y"))
 for _ in range(number_items):
 x = np.random.random() * 100
 delta = 50 * np.random.random() - 25
 y = -3 * (x + delta) + 250
 result = result.with_row(make_array(x,y))
 return result

In [None]:
data = generate_table(100)
data.scatter("x")
plots.plot([0, 100], [250, -50], color='red', lw=2);

## 15.1 - Correlation

**Linear Assocation** - how tightly clustered a scatter diagram is about a straight line.
Do x and y in the graph above display a *positive* or *negative* association?

In [None]:
# Recall from Chapter 14 ...
def standard_units(numbers):
 "Convert any array of numbers to standard units."
 return (numbers - np.mean(numbers))/np.std(numbers) 

In [None]:
standardized_data = Table().with_columns(
 'x (standard units)', standard_units(data.column('x')), 
 'y (standard units)', standard_units(data.column('y'))
)
standardized_data.scatter(0, 1)
plots.xlim(-3, 3)
plots.ylim(-3, 3);

*Correlation Coefficient* or **r** - measures the strength of the linear relationship between two variables.
- The correlation coefficient is a number between -1 and 1.
- **r** measures the extent to which the scatter plot clusters around a straight line.
- if **r** == 1, the scatter diagram is a 45 degree perfect straight line sloping upwards, 
and if **r** == -1 the scatter diagram is a 45 degree perfect straight line sloping downwards.
- **r** is the average of the products of the two variables, when both variables are measured in standard units.

In [None]:
r = np.mean(standardized_data.column("x (standard units)") * standardized_data.column("y (standard units)"))
print("r =", r)

What would r be if *delta* in *generate_table* above is always 0?

In [None]:
# Let's make a function to calculate r from non-standardized data
def correlation(table, label_1, label_2):
 return np.mean(standard_units(table.column(label_1))*standard_units(table.column(label_2)))

In [None]:
r = correlation(data, "x", "y")
print("r =", r)

A few caveats
- Correlation measures association, not causation
- Correlation only measures linear association
- Outliers can significantly impact correlation

## 15.2 - The Regression Line

In [None]:
# Calculate the slope of the regression line, y = mx + b
m = r * (np.std(data.column("y")) / np.std(data.column("x")))
print("m =", m)

In [None]:
# Calculate the intercept of the regression line, y = mx + b
b = np.mean(data.column("y")) - m*np.mean(data.column("x"))
print("b =", b)

Thus, the equation of the regression line is approximately

 y = mx + b (plug in m and b values from above)

This should be pretty close to the following equation that was used to generate the random data

 y = -3x + 250