Chapter 4: Artificial Neural Networks
Applications
- Hand writing recognition
- Face recognition
- Speech recognition
- Identifying fraudulent credit card transactions
Problem Characteristics
- Instances are represented by attribute-value pairs that
can be discrete or continuous
- The target function can be discrete-valued, real-valued or
vector-valued
- The training examples can be noisy (classified incorrectly or
incomplete)
- A lengthy training time is acceptable
- A fast target function evaluation is desirable
- Humans don't need to understand the target function - neural
networks are "opaque"
Human Brain
- 1011 neurons
- 104 connections per neuron
- A neuron can fire in the time frame of 10-3 second
- Parallel
- Distributed
ALVINN
- Figure 4.1
- There is a 30 by 32 grid of pixel inputs
- There are 4 hidden units
- There are 30 output units, representing actions from "sharp left" to
"sharp right"
- ALVINN Website
- In 2005, Stanford's Stanley
finished first in the 175 mile desert race
General Representation
- A weighted graph
- A learning algorithm (such as backpropagation) to modify the weights
Perceptron (Single Unit)
- Figure 4.2 shows a schematic
- Let wi be a weight
- Let xi be an input (-1 or 1)
- Let w0 serve as the threshold unit and
x0 always be 1
- The output of a perceptron is computed as follows:
if Σ wi * xi > 0 then 1 else -1
- The hypothesis space is an n+1 dimensional real-valued
vector
- A perceptron can represent any linearly separable concept
- Examples of linearly separable concepts: and, or, nand, nor,
m-of-n
- Example of a non-linearly separable concept: xor
Perceptron Training Rule
- Guaranteed to converge provided that the learning constant
(η) is small enough and that the training examples
are linearly separable
- Let t be the target classification
- Let o be the actual classification
- A typical value for η is 0.1
Algorithm:
- Initialize the weights randomly
- Iterate through the training examples
- If the example is misclassified,
Δ wi = η (t - o) xi
- Go to step 2 if there were any errors
Delta Rule (Gradient Descent)
- Converges to a best fit approximation if the training
examples are not linearly separable
- No threshold unit is used in this case
- Figure 4.4 illustrates the notion of gradient descent
- The error is defined to be
1/2 * Σ (td - od)2
- This rule serves as the basis for the backpropagation algorithm
- A potential drawback is that this technique is slow
- Another potential drawback is that this technique is not
guaranteed to find the global minimum, just a local one
Algorithm:
- Initialize each wi to a small random value
- Until the termination criteria is met
- Initialize each Δ wi to zero
- For each training example, calculate its output and then
calculate Δ wi = Δ wi +
η (t - o) xi
- Calculate wi = wi + Δ wi
Stochastic Gradient Descent
- Like gradient descent, but weight updates
are computed incrementally
Perceptron (Multiple Unit)
- Can represent non-linearly separable concepts
- There is no known learning algorithm that converges
- This provides the motivation for studying neural networks and
the backpropagation algorithm