Chapter 6: Bayesian Learning

Bayesian learning is a probabilistic approach to inference.
The naive Bayes classifier is a practical classification alternative.
Baye's Theorem provides a useful perspective for understanding other machine learning approaches.

Advantages

P(h): prior probability of h
P(D): prior probability of observing training data D
P(h|D): posterior probability of h, given D
Baye's Theorem. P(h|D) = P(D|h) * P(h) / P(D)
MAP hypothesis: maximally probable hypothesis (or maximally posteriori hypothesis)
h_MAP = argmax P(h|D) = argmax P(D|h) * P(h) / P(D) = argmax P(D|h) * P(h) (since P(D) is a constant)
If all h are equally likely, h_ML = argmax P(D|h) where ML stands for maximum likelihood
Take a look at the example on page 158

Product Rule. P(A^B) = P(A|B)*P(B) = P(B|A) * P(A)
Sum rule. P(A ∨ B) = P(A) + P(B) - P(A^B)
Total Probability Rule. P(B) = Σ P(B|A_i) * P(A_i) where the A_i are mutually exclusive and sum to 1

Using these assumptions, we can reason as follows:

So when h is consistent with D,

P(h|D) = 1 * (1/|H|) / (|VS_H,D| / |H|) = 1 / |VS_H,D| where VS_H,D is the subset of hypotheses from H consistent with D

The maximally specific member of the version space is a MAP hypothesis if either (1) all h are equally likely or if (2) more specific hypotheses are at least as probable as more general ones.

Any algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output an ML hypothesis if (1) the training examples are generated by adding random noise to the true target value (but not to the other attributes) and (2) the noise is drawn independently for each example using a normal distribution with a 0 mean.