Chapter 4: Artificial Neural Networks

Chad Bishop's Neural Networks for Pattern Recognition is a good introductory book.

Multilayer Networks

Can represent highly nonlinear decision surfaces. See Figure 4.5.
Desire a node that has a nonlinear output, yet is differentiable. One solution is to use a sigmoid threshold unit where the output,
o = &sigma(net) = 1 / (1 + e^-net) and net = Σ w_i * x_i
Note that σ'(y) = σ(y) * (1 - σ(y))

Uses gradient descent to minimize the squared error between the network output values and the target values.
Although backpropagation can settle into a local minimum, in practice backpropagation produces very good results.
Table 4.2 shows the algorithm.
Notice that weights are updated incrementally.
Thousands of iterations might be needed in practice!
What are some reasonable termination criteria?
Adding momentum. α is the momentum constant. Δ w_ji(n) = η * δ_j * x_ji + α * Δ w_ji(n-1)

An n dimensional vector of real numbers where n is the number of weights in the network.

Take a look at Figure 4.9 to understand this problem.
One solution is to use a validation set.
10-fold (or more generally k-fold) cross validation is a very common validation technique.