Chapter 6: Bayesian Learning
- Bayesian learning is a probabilistic approach to inference.
- The naive Bayes classifier is a practical classification alternative.
- Baye's Theorem provides a useful perspective for understanding other
machine learning approaches.
Advantages
- Handles incremental information
- Prior knowledge can be used
- Probabilistic predictions are possible
- Easy to combine multiple hypotheses
- Good benchmark for other approaches
Drawbacks
- The amount of required initial knowledge might be very large
- There is a potentially high computational cost
Baye's Theorem
- P(h): prior probability of h
- P(D): prior probability of observing training data D
- P(h|D): posterior probability of h, given D
- Baye's Theorem. P(h|D) = P(D|h) * P(h) / P(D)
- MAP hypothesis: maximally probable hypothesis (or maximally
posteriori hypothesis)
- hMAP = argmax P(h|D)
= argmax P(D|h) * P(h) / P(D)
= argmax P(D|h) * P(h) (since P(D) is a constant)
- If all h are equally likely, hML = argmax P(D|h) where
ML stands for maximum likelihood
- Take a look at the example on page 158
Other Rules
- Product Rule. P(A^B) = P(A|B)*P(B) = P(B|A) * P(A)
- Sum rule. P(A ∨ B) = P(A) + P(B) - P(A^B)
- Total Probability Rule.
P(B) = Σ P(B|Ai) * P(Ai)
where the Ai are mutually exclusive and sum to 1
Brute Force MAP Learning Algorithm
- Calculate P(h|D) = P(D|h) * P(h) / P(D) for every h
- Find argmax P(h|D)
One Set of Assumptions
- D is noise free
- The target concept c is in H
- All h are equally likely
Using these assumptions, we can reason as follows:
- P(h) = 1 / |H|
- P(D|h) = 1 if h is consistent with D and 0 otherwise
So when h is consistent with D,
- P(h|D) = 1 * (1/|H|) / (|VSH,D| / |H|) = 1 / |VSH,D|
where VSH,D is the subset of hypotheses from H consistent
with D
Find-S Relevance
- The maximally specific member of the version space is a MAP
hypothesis if either (1) all h are equally likely or if (2) more specific
hypotheses are at least as probable as more general ones.
Backpropagation Relevance
- Any algorithm that
minimizes the squared error between the output hypothesis
predictions and the training data will output an ML hypothesis if
(1) the training examples are generated by adding random noise
to the true target value (but not to the other attributes)
and (2) the noise is drawn independently for each example using
a normal distribution with a 0 mean.
Practice Exercises