Chapter 8: Instance Based Learning

Training examples are stored, thus the information contained in these examples is never lost
Typically, this is lazy (as opposed to eager) learning because processing is delayed until a new instance must be classified
A lazy learner constructs a different approximation to the target function for each distinct query instead of forming an explicit hypothesis

Terminology

Regression: approximating a real-valued target function
Residual: the error in approximating the target function
Kernel Function: the function of distance that is used to determine the weight of each training example, w_i = K(d(x_i, x_q))

Uses Euclidean distance as the distance between two points
The nature of the hypothesis space for k=1 is a Voronoi diagram, see Figure 8.1
The inductive bias is to provide a classification that is most similar to the classification of nearby instances
A drawback is that it suffers from the curse of dimensionality

Classification: argmax_{v ∈ V} Σ δ(v, f(x_i)) where x₁ ... x_k are the k nearest neighbors
Distance weighted variation: argmax_{v ∈ V} Σ w_i δ(v, f(x_i)) where w_i = (d(x_q, x_i))^-2 and x_q is the query instance

A generalization of k nearest neighbor
Local: the function is approximated based on data near the query point
Weighted: the contribution of each training example is weighted by its distance from the query point
The estimate f(x) = w₀ + w₁a₁ + ... + w_na_n

This is the only example in today's material of an eager learner
Provides a global approximation to the target function, represented by a linear combination of many local kernel functions
Each of the k kernel functions is a local approximation to the target function
The value for a given kernel function is non-negligible only when input x falls into the region defined by the kernel's center and width
f(x) = w₀ + Σ_k w_μ K_μ(d(x_μ, x)) where x_μ ∈ X
Typically K_μ is the Gaussian function, e^{-(1/2σ_μ²)
d²(x_μ, x)} where x_μ is the center and σ_μ² is the variance
Take a look at Figure 8.2
An advantage of RBFs is that the input layer and the output layer can be set up separately. First, k can be chosen and then each x_μ and σ_μ² can be given values. Second, the w_i can be trained by minimizing E = (1/2) Σ (f(x) - f(x))².
RBFs have been successfully used to interpret visual scenes
RBFs can approximate any function with arbitrarily small error, given a large enough k and provided that each σ² can be specified separately

Doesn't represent instances as real-valued points in an n dimensional space, it typically uses symbolic descriptions instead
Multiple retrieved cases may be combined to form the solution, this typically is knowledge intensive and might involve search
It is difficult to know how to identify similar cases
Figure 8.3 shows one entry in the CADET system's case library and a problem to solve
One application of CBR is to perform legal reasoning