Chapter 3: Decision Tree Learning
Inductive Bias
- Approximate bias of ID3: Shorter trees are preferred over
longer trees. Trees that place high information gain attributes
close to the root are preferred over those that do not.
- Occam's Razor: prefer the simplest hypothesis that fits the data.
- Preference or search bias: preference for certain hypotheses
over others with no hard restriction on the hypotheses that
can be enumerated. ID3 demonstrates a preference bias.
- Restriction or language bias: a categorical restriction on the
set of hypotheses considered. The candidate elimination algorithm
demonstrates a restriction bias.
- In general, it is better to have a preference bias than a restriction
bias. However, some learning systems have both biases.
Decision Tree Issues
1. Avoiding Overfitting the Data
- Definition: Given a hypothesis space H, a hypothesis h ∈ H
is said to overfit the training data if there exists some
alternative hypothesis h' ∈ H, such that h has smaller
error than h' over the training examples, but h' has a smaller
error than h over the entire distribution of instances.
- Figure 3.6 illustrates the problem of overfitting.
- Why might overfitting occur? There might be noise in the training
data. There might be coincidental regularities in the data.
- One solution is to stop growing the tree early.
- Another solution is to post-prune the tree. This approach
has been found to be more successful in practice.
- To make decisions for either of these approaches,
we should split the known
data into training data and validation data.
Node Pruning
- Remove the subtree rooted at that node, make it a leaf node and
assign the most common classification to it. Only do this if
the resulting tree performs no worse on the validation data.
- Figure 3.7 illustrates this process.
Rule Pruning (C4.5)
- Build a decision tree.
- Convert the tree into an equivalent set of rules.
- Generalize each rule by removing any preconditions that
result in improving its estimated accuracy.
- Sort the pruned rules by their estimated accuracy
and use them in this sequence when classifying subsequent
instances.
2. Incorporating Continuous-Valued Attributes
- A binary solution is to order the n pieces of data and
consider the n-1 middle points
that partition the data. Calculate the information gain that
results from making the test attribute ≤ value.
- The binary solution can be extended to multiple intervals.
3. Alternative Measures for Selecting Attributes
- Problem: Attributes with large numbers of values will
partition the data perfectly. For example, the attribute
date-and-time-of-event-in-milliseconds-since-year-0.
- Solution: Modify information gain to penalize such attributes.
- SplitInformation(S, A) = - Σ |Si| / |S| *
log2 |Si| / |S|
- GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)
4. Handling Missing Attribute Values
- Solution: Assign the value to be the most commonly occurring
value among the set of examples at this node.
- Solution: Assign the value to be a distribution of probabilities.
5. Handling Attributes with Different Costs
- Observation: Obtaining a patient's temperature is less costly
than obtaining an MRI for the same patient.
- Idea: Choose the next attribute based on this formula:
Gain2(S,A) / Cost(A).
- Idea: Choose the next attribute based on this formula:
(2Gain(S,A) - 1) / (Cost(A) + 1)w where
w ∈ [0,1] is a constant that determines the relative
importance of cost versus information gain.
Practice Exercises