Chapter 3: Decision Tree Learning

Inductive Bias

Approximate bias of ID3: Shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not.
Occam's Razor: prefer the simplest hypothesis that fits the data.
Preference or search bias: preference for certain hypotheses over others with no hard restriction on the hypotheses that can be enumerated. ID3 demonstrates a preference bias.
Restriction or language bias: a categorical restriction on the set of hypotheses considered. The candidate elimination algorithm demonstrates a restriction bias.
In general, it is better to have a preference bias than a restriction bias. However, some learning systems have both biases.

Definition: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h' ∈ H, such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances.
Figure 3.6 illustrates the problem of overfitting.
Why might overfitting occur? There might be noise in the training data. There might be coincidental regularities in the data.
One solution is to stop growing the tree early.
Another solution is to post-prune the tree. This approach has been found to be more successful in practice.
To make decisions for either of these approaches, we should split the known data into training data and validation data.

Remove the subtree rooted at that node, make it a leaf node and assign the most common classification to it. Only do this if the resulting tree performs no worse on the validation data.
Figure 3.7 illustrates this process.

Build a decision tree.
Convert the tree into an equivalent set of rules.
Generalize each rule by removing any preconditions that result in improving its estimated accuracy.
Sort the pruned rules by their estimated accuracy and use them in this sequence when classifying subsequent instances.

A binary solution is to order the n pieces of data and consider the n-1 middle points that partition the data. Calculate the information gain that results from making the test attribute ≤ value.
The binary solution can be extended to multiple intervals.

Problem: Attributes with large numbers of values will partition the data perfectly. For example, the attribute date-and-time-of-event-in-milliseconds-since-year-0.
Solution: Modify information gain to penalize such attributes.
SplitInformation(S, A) = - Σ |S_i| / |S| * log₂ |S_i| / |S|
GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)

Solution: Assign the value to be the most commonly occurring value among the set of examples at this node.
Solution: Assign the value to be a distribution of probabilities.

Observation: Obtaining a patient's temperature is less costly than obtaining an MRI for the same patient.
Idea: Choose the next attribute based on this formula: Gain²(S,A) / Cost(A).
Idea: Choose the next attribute based on this formula: (2^Gain(S,A) - 1) / (Cost(A) + 1)^w where w ∈ [0,1] is a constant that determines the relative importance of cost versus information gain.