Chapter 6: Bayesian Learning

Minimum Description Length Principle

Recall that h_MAP = argmax_h P(D|h) p(h)
Equivalently, h_MAP = argmax_h ( log₂P(D|h) + log₂P(h))
Or, h_MAP = argmin_h (- log₂P(D|h) - log₂P(h))
Let L_C(i) denote the number of bits required to encode message i using code C
Then L_{C_H}(h) = - log₂ P(h) under the optimal encoding
Then L_{C_D|h}(h) = -log₂P(D|h) under the optimal encoding
So, h_MAP = argmin_h (L_{C_H}(h) + L_{C_D|h}(D|h))
The minimum description length principle is to choose h_MDL to minimize the above equation
This principle trades off hypothesis complexity for the number of errors committed by the hypothesis
The practical upshot is that if we can come up with such a minimal encoding, it will be a MAP hypothesis. However, knowing that we have a minimal encoding is very difficult to determine because it requires knowing many prior probabilities.

argmax_{v_j ∈ V} Σ_{h_i ∈ H} P(v_j|h_i)P(h_i|D) where V is the set of possible classifications
See page 175 for a simple example
It is expensive due to the need to compute the posterior probability of each hypothesis in H and then to combine these predictions

Choose a hypothesis h from H at random, according to the posterior probability distribution over H
Use h to predict the classification of the next instance x

Using this method will have an expected error at most twice that of the Bayes optimal classifier.

Examples must be described by conjunctions of attributes
The target values come from a finite set V
Assume that all attribute values are conditionally independent, given the target value
v_MAP = argmax P(v_j|a₁ ... a_n)
= argmax P(a₁...a_n|v_j)P(v_j)
= argmax P(v_j) Π (a_i|v_j)
Look at the example on pages 178-179
Often, the probabilities are estimated using (n_c + mp) / (n + m) where p is the probability of the value occurring (typically estimated to be 1/k where k is the number of possible values that the attribute takes) and m is the number of virtual samples to add to the observations