Chapter 7: Computational Learning Theory

The goal of this chapter is to answer such questions as:

Sample Complexity. How many training examples are needed for the learner to converge (with high probability) to a successful hypothesis?
Computational Complexity. How much computational effort is needed for a learner to converge (with high probability) to a successful hypothesis?
Mistake Bound. How many training examples will the learner misclassify before converging to a successful hypothesis?

PAC Learning

PAC: probably approximately correct
The results are limited to learning boolean-valued concepts from noise-free training data
X: instance space
C: set of target concepts
D: probability distribution for generating instances from X
L: learner
H: set of hypotheses that learner considers
True error, error_D(h) ≡ Pr_{x ∈ D}[(c(x) ≠ h(x)]
Training error: fraction of training examples misclassified by h
Observation: L can only observe the performance of h over the training examples
Issue: How probable is it that the observed training error for h gives a misleading estimate of the true error?

We would like to find an h such that error_D(h) = 0. This is not possible because (1) unless every possible instance of X is in the training set, there might be multiple hypotheses consistent with the training data and (2) there is a small chance that the training examples will be misleading
Therefore, we will require that error_D(h) < ε
Therefore, we will require that the probability of failure on a sequence of randomly drawn training examples be bounded by δ
Definition. Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis space H. C is PAC-learnable by L using H if for all c ∈ C, distributions D over X, ε such that 0 < ε < 1/2 and δ such that 0 < δ < 1/2, learner L will with probability at least (1 - δ) output a hypothesis h ∈ H such that error_D(h) ≤ ε, in time that is polynomial in 1/ε, 1/δ, n and size(c).

A learner is consistent if it outputs hypotheses that perfectly fit the training data, whenever possible
Definition. Consider a hypothesis space H, target concept c, instance distribute D, and a set of training examples D of c. The version space VS_H,D is said to be ε-exhausted with respect to c and D, if every hypothesis h in VS_H,D has error less than ε with respect to c and D.
(∀ h ∈ VS_H,D) error_D(h) < ε
The above definition is illustrated in Figure 7.2

If the hypothesis space H is finite and D is a sequence of m ≥ 1 independent randomly drawn examples of some target concept c, then for any 0 ≤ ε ≤ 1, the probability that the version space VS_H,D is not ε-exhausted (with respect to c) is ≤ |H|e^-εm

Let h₁ ... h_k be all the hypotheses in H that have true error > ε with respect to c.
The probability that any single h_i is consistent with one randomly drawn example is ≤ (1 - ε)
The probability that h_i is consistent with all m randomly drawn examples is ≤ (1 - ε)^m
The probability that at least one of these k h_i is consistent with all m training examples is ≤ k(1 - ε)^m
Observation: k ≤ |H|
Property: (1 - ε) ≤ e^-ε
k(1 - ε)^m ≤ |H|e^-εm

Consider the class C of target concepts described by conjunctions of boolean literals.
|H| = 3ⁿ
Thus, m ≥ (1/ε)(n ln 3 + ln(1/δ))
This concept is PAC-learnable.
Theorem. The class C of conjunctions of boolean literals is PAC-learnable by the FIND-S algorithm using H = C.
If the target concept can be described by up to 10 boolean literals and we desire a 95% probability that the hypothesis will have an error ≤ 0.1, we can calculate that we need 140 examples.

Sometimes a zero error hypothesis cannot be found
Agnostic learner: one that makes no assumption that the target concept is representable in H
After doing some math, similar to above ...
P[∃ h ∈ H)(error_D(h) > error_D(h) + ε)] ≤ |H|e^-2mε²
m ≥ 1/(2ε²)[ln|H| + ln(1/δ)]