Chapter 7: Computational Learning Theory
The goal of this chapter is to answer such questions as:
- Sample Complexity. How many training examples are needed for the
learner to converge (with high probability) to a successful hypothesis?
- Computational Complexity. How much computational effort is needed for
a learner to converge (with high probability) to a successful hypothesis?
- Mistake Bound. How many training examples will the learner misclassify
before converging to a successful hypothesis?
PAC Learning
- PAC: probably approximately correct
- The results are limited to learning boolean-valued concepts
from noise-free training data
- X: instance space
- C: set of target concepts
- D: probability distribution for generating instances from X
- L: learner
- H: set of hypotheses that learner considers
- True error, errorD(h) ≡ Prx ∈ D[(c(x) ≠ h(x)]
- Training error: fraction of training examples misclassified by h
- Observation: L can only observe the performance of h over
the training examples
- Issue: How probable is it that the observed training error for h
gives a misleading estimate of the true error?
PAC Learnability
- We would like to find an h such that errorD(h) = 0.
This is not possible because (1) unless every possible instance
of X is in the training set, there might be multiple hypotheses
consistent with the training data and (2) there is a small chance
that the training examples will be misleading
- Therefore, we will require that errorD(h) < ε
- Therefore, we will require that the probability of failure
on a sequence of randomly drawn training examples be bounded by δ
- Definition. Consider a concept class C defined over a set of instances
X of length n and a learner L using hypothesis space H. C is
PAC-learnable
by L using H if for all c ∈ C, distributions D over X,
ε such that 0 < ε < 1/2 and δ such that
0 < δ < 1/2, learner L will with probability at least
(1 - δ) output a hypothesis h ∈ H such that
errorD(h) ≤ ε, in time that is polynomial
in 1/ε, 1/δ, n and size(c).
Sample Complexity for Finite Hypothesis Spaces
- A learner is consistent if it outputs hypotheses
that perfectly fit the training data, whenever possible
- Definition. Consider a hypothesis space H, target concept c,
instance distribute D, and a set of training examples D of c.
The version space VSH,D is said to be
ε-exhausted with respect to c and D, if every
hypothesis h in VSH,D has error less than ε
with respect to c and D.
(∀ h ∈ VSH,D)
errorD(h) < ε
- The above definition is illustrated in Figure 7.2
Theorem: ε-Exhausting the Version Space
- If the hypothesis space H is finite and D is a sequence of
m ≥ 1 independent randomly drawn examples of some target concept
c, then for any 0 ≤ ε ≤ 1, the probability that the
version space VSH,D is not ε-exhausted (with
respect to c) is ≤ |H|e-εm
Proof
- Let h1 ... hk be all the hypotheses in H
that have true error > ε with respect to c.
- The probability that any single hi is consistent
with one randomly drawn example is ≤ (1 - ε)
- The probability that hi is consistent with all
m randomly drawn examples is ≤ (1 - ε)m
- The probability that at least one of these k hi is
consistent with all m training examples is ≤
k(1 - ε)m
- Observation: k ≤ |H|
- Property: (1 - ε) ≤ e-ε
- k(1 - ε)m ≤ |H|e-εm
Practical Outcome
- We want |H|e-εm ≤ δ
- Or m ≥ 1/ε (ln|H| + ln(1/δ))
- This estimate for m might be a substantial overestimate!
Application - Conjunctions of Boolean Literals
- Consider the class C of target concepts described by
conjunctions of boolean literals.
- |H| = 3n
- Thus, m ≥ (1/ε)(n ln 3 + ln(1/δ))
- This concept is PAC-learnable.
- Theorem. The class C of conjunctions of boolean literals is
PAC-learnable by the FIND-S algorithm using H = C.
- If the target concept can be described by up to 10 boolean
literals and we desire a 95% probability that the hypothesis
will have an error ≤ 0.1, we can calculate that we
need 140 examples.
Application - Unbiased Learner
- If there are n, boolean features, there are 2n instances
- |H| = 22n
- m ≥ (1/ε)(2n ln 2 + ln (1/δ))
- This concept is not PAC-learnable.
Agnostic Learning and Inconsistent Hypotheses
- Sometimes a zero error hypothesis cannot be found
- Agnostic learner: one that makes no assumption that
the target concept is representable in H
- After doing some math, similar to above ...
- P[∃ h ∈ H)(errorD(h) >
errorD(h) + ε)] ≤
|H|e-2mε2
- m ≥ 1/(2ε2)[ln|H| + ln(1/δ)]
Exercises