Chapter 4: Artificial Neural Networks

Applications

Instances are represented by attribute-value pairs that can be discrete or continuous
The target function can be discrete-valued, real-valued or vector-valued
The training examples can be noisy (classified incorrectly or incomplete)
A lengthy training time is acceptable
A fast target function evaluation is desirable
Humans don't need to understand the target function - neural networks are "opaque"

Figure 4.1
There is a 30 by 32 grid of pixel inputs
There are 4 hidden units
There are 30 output units, representing actions from "sharp left" to "sharp right"
ALVINN Website
In 2005, Stanford's Stanley finished first in the 175 mile desert race

Figure 4.2 shows a schematic
Let w_i be a weight
Let x_i be an input (-1 or 1)
Let w₀ serve as the threshold unit and x₀ always be 1
The output of a perceptron is computed as follows: if Σ w_i * x_i > 0 then 1 else -1
The hypothesis space is an n+1 dimensional real-valued vector
A perceptron can represent any linearly separable concept
Examples of linearly separable concepts: and, or, nand, nor, m-of-n
Example of a non-linearly separable concept: xor

Guaranteed to converge provided that the learning constant (η) is small enough and that the training examples are linearly separable
Let t be the target classification
Let o be the actual classification
A typical value for η is 0.1

Algorithm:

Initialize the weights randomly
Iterate through the training examples
- If the example is misclassified, Δ w_i = η (t - o) x_i
Go to step 2 if there were any errors

Converges to a best fit approximation if the training examples are not linearly separable
No threshold unit is used in this case
Figure 4.4 illustrates the notion of gradient descent
The error is defined to be 1/2 * Σ (t_d - o_d)²
This rule serves as the basis for the backpropagation algorithm
A potential drawback is that this technique is slow
Another potential drawback is that this technique is not guaranteed to find the global minimum, just a local one

Algorithm:

Initialize each w_i to a small random value
Until the termination criteria is met
- Initialize each Δ w_i to zero
- For each training example, calculate its output and then calculate Δ w_i = Δ w_i + η (t - o) x_i
- Calculate w_i = w_i + Δ w_i

Can represent non-linearly separable concepts
There is no known learning algorithm that converges
This provides the motivation for studying neural networks and the backpropagation algorithm