Chapter 6: Bayesian Learning
Minimum Description Length Principle
- Recall that hMAP = argmaxh P(D|h) p(h)
- Equivalently, hMAP = argmaxh (
log2P(D|h) + log2P(h))
- Or, hMAP = argminh
(- log2P(D|h) - log2P(h))
- Let LC(i) denote the number of bits required to encode
message i using code C
- Then LCH(h) = - log2 P(h)
under the optimal encoding
- Then LCD|h(h) = -log2P(D|h)
under the optimal encoding
- So, hMAP = argminh
(LCH(h) + LCD|h(D|h))
- The minimum description length principle is to choose
hMDL to minimize the above equation
- This principle trades off hypothesis complexity for the number
of errors committed by the hypothesis
- The practical upshot is that if we can come up with such a minimal
encoding, it will be a MAP hypothesis. However, knowing that we
have a minimal encoding is very difficult to determine because
it requires knowing many prior probabilities.
Bayes Optimal Classifier
- argmaxvj ∈ V
Σhi ∈ H
P(vj|hi)P(hi|D)
where V is the set of possible classifications
- See page 175 for a simple example
- It is expensive due to the need to compute the posterior
probability of each hypothesis in H and then to
combine these predictions
Gibbs Algorithm
- Choose a hypothesis h from H at random, according to
the posterior probability distribution over H
- Use h to predict the classification of the next instance x
Using this method will have an expected error at most twice that
of the Bayes optimal classifier.
Naive Bayes Classifier
- Examples must be described by conjunctions of attributes
- The target values come from a finite set V
- Assume that all attribute values are conditionally independent,
given the target value
- vMAP = argmax P(vj|a1 ... an)
- = argmax P(a1...an|vj)P(vj)
- = argmax P(vj) Π (ai|vj)
- Look at the example on pages 178-179
- Often, the probabilities are estimated using
(nc + mp) / (n + m)
where p is the probability of the value occurring (typically estimated
to be 1/k where k is the number of possible values that the attribute
takes) and m is the number of virtual samples to add to the
observations
Naive Bayes Classifier Application: Learning to Classify Text
- Take a look at the algorithm in Table 6.2
- There were 20 possible newsgroups
- 1000 articles were collected from each newsgroup
- Two-thirds of the articles were used for training
- The 100 most commonly occurring words were removed from
the articles
- Words occurring fewer than three times were also removed
- The resulting vocabulary consisted of roughly 38,500 words
- The classifier was able to perform with 89% accuracy on
the test data!