Chapter 7: Computational Learning Theory

Sample Complexity for Infinite Hypothesis Spaces

The Vapnik-Chervonenkis dimension of H, VC(H), allows us to typically place a fairly tight bound on the sample complexity
A set of instances S is shattered by hypothesis H if and only if for every dichotomy of S (2^|S| total), there exists some hypothesis in H consistent with this dichotomy. Take a look at Figure 7.3.
VC(H) of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H
Note: if an arbitrarily large finite set of X can be shattered by H, then VC(H) ≡ ∞
Note: for finite H, VC(H) ≤ log₂|H|

If X = {real numbers} and H = {intervals on the real line} then VC(H) = 2
If X = {points on the x,y plane} and H = {linear decision surfaces} then VC(H) = 3. See Figure 7.4.
If X = {conjunctions of exactly 3 boolean literals} and H = {conjunctions of up to 3 boolean literals} then VC(H) = 3

Upper Bound. m ≥ (1/ε)[4 log₂(2/δ) + 8 VC(H) log₂(13/ε)]
Lower bound. Consider any concept class C such that VC(C) ≥ 2, and learner L, and any 0 < ε < 1/8 and 0 < δ < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than max[(1/ε)log(1/δ), (VC(C) - 1)/(32 ε)], then with probability at least δ, L outputs a hypothesis h having error_D(h) > ε

How many mistakes will the learner make in its predictions before it learns the target concept?
The learner must predict c(x) before being told the answer
This is useful in applications such as predicting fraudulent credit card purchases
In this section, we are interested in learning the target concept exactly

Initialize h to a₁ ⋀ ¬a₁ ... ⋀ a_n ⋀ ¬a_n
For each positive training instance x, remove from h any literal that is not satisfied by x
Output h

The largest number of mistakes that can be made to learn a concept (the mistake bound) is n + 1

Use a version space
The goal is to reduce the number of viable hypotheses to 1
The classification is determined using a majority vote
The mistake bound is log₂|H|
Again, this is an upper bound - it is possible to learn without making any mistakes

Let M_A(c) denote the maximum over all possible sequences of training examples of the number of mistakes made by A to exactly learn c
Let M_A(C) ≡ max M_A(c)
Let C be an arbitrary non-empty concept class
The optimal mistake bound for C, Opt(C), is the minimum over all possible learning algorithms A of M_A(C)
VC(C) ≤ Opt(C) ≤ M_HALVING(C) ≤ log₂|C|

Table 7.1 shows the algorithm
Theorem: Relative mistake bound for Weighted-Majority. Let D be any sequence of training examples, let A be any set of n prediction algorithms, and let k be the minimum number of mistakes made by any algorithm in A for the training sequence D. Then the number of mistakes over D made by the Weighted-Majority algorithm using Β = 1/2 is at most 2.4(k + log₂n)
Note: the weighted majority idea is the basis behind many ensemble learners that use boosting (such as AdaBoost) in order to obtain better classification accuracy