Chapter 7: Computational Learning Theory
Sample Complexity for Infinite Hypothesis Spaces
- The Vapnik-Chervonenkis dimension of H, VC(H), allows us to typically
place a fairly tight bound on the sample complexity
- A set of instances S is shattered by hypothesis H if and
only if for every dichotomy of S (2|S| total),
there exists some hypothesis in H consistent with this dichotomy.
Take a look at Figure 7.3.
- VC(H) of hypothesis space H defined over instance space X is the
size of the largest finite subset of X shattered by H
- Note: if an arbitrarily large finite set of X can be shattered
by H, then VC(H) ≡ ∞
- Note: for finite H, VC(H) ≤ log2|H|
Examples
- If X = {real numbers} and H = {intervals on the real line} then
VC(H) = 2
- If X = {points on the x,y plane} and H = {linear decision surfaces}
then VC(H) = 3. See Figure 7.4.
- If X = {conjunctions of exactly 3 boolean literals} and
H = {conjunctions of up to 3 boolean literals} then VC(H) = 3
Sample Complexity Bounds
- Upper Bound. m ≥ (1/ε)[4 log2(2/δ) +
8 VC(H) log2(13/ε)]
- Lower bound. Consider any concept class C such that VC(C) ≥ 2,
and learner L, and any 0 < ε < 1/8 and
0 < δ < 1/100. Then there exists a distribution D
and target concept in C such that if L observes fewer examples
than max[(1/ε)log(1/δ), (VC(C) - 1)/(32 ε)],
then with probability at least δ, L outputs a hypothesis h
having errorD(h) > ε
Mistake Bound of Learning
- How many mistakes will the learner make in its predictions before
it learns the target concept?
- The learner must predict c(x) before being told the answer
- This is useful in applications such as predicting fraudulent credit
card purchases
- In this section, we are interested in learning the target
concept exactly
Find-S
- Initialize h to a1 ⋀ ¬a1 ...
⋀ an ⋀ ¬an
- For each positive training instance x, remove from h any literal
that is not satisfied by x
- Output h
The largest number of mistakes that can be made to learn a concept
(the mistake bound) is n + 1
Halving Algorithm
- Use a version space
- The goal is to reduce the number of viable hypotheses to 1
- The classification is determined using a majority vote
- The mistake bound is log2|H|
- Again, this is an upper bound - it is possible to learn without
making any mistakes
Optimal Mistake Bounds
- Let MA(c) denote the maximum over all possible sequences
of training examples of the number of mistakes made by A to exactly
learn c
- Let MA(C) ≡ max MA(c)
- Let C be an arbitrary non-empty concept class
- The optimal mistake bound for C, Opt(C), is the minimum over
all possible learning algorithms A of MA(C)
- VC(C) ≤ Opt(C) ≤ MHALVING(C) ≤ log2|C|
Weighted Majority Algorithm
- Table 7.1 shows the algorithm
- Theorem: Relative mistake bound for Weighted-Majority. Let D
be any sequence of training examples, let A be any set of n
prediction algorithms, and let k be the minimum number of mistakes
made by any algorithm in A for the training sequence D.
Then the number of mistakes over D made by the Weighted-Majority
algorithm using Β = 1/2 is at most 2.4(k + log2n)
- Note: the weighted majority idea is the basis behind many ensemble
learners that use boosting (such as
AdaBoost)
in order to obtain better classification accuracy
Exercises