Back Propogation

Consider the following model:

  input
  layer   layer 1    layer 2    layer l      Layer L
                                                    
    v  -----> O --------|      --> O           O ---> o
     1                  |         /                    1
                        O ...    /
                      / ^ 
    v  -----> O -----/  |          O           O ---> o
     2   .              |                         .    2
         .              |                         .
         .              |        \                .
    v  -----> O --------|     ---> O           O ---> o
     N                  .                              M
                        .

where there N inputs, v_i, L layers, N_l nodes in each layer, l, M outputs, o_i, and M target output values, t_i. Each node in layer l has a linear activation function, s_i^l = A_i^l'V_i, where A_i^l is a weight vector, and a transfer function, y_i^l= f_i^l(s_i). Note that the output of a node in a hidden layer is y rather than o.

Now define the following:

a^l_ji is the weight in node j in layer l for the connection from node i in layer l-1 to node j layer l.
f^l(s)is the same for all nodes in layer l.
The error at an output node, i, is E = t^L_i - o^L_i.
The least mean squares error criterion provides that the partial derivative of E^l_i(o) = E^l_i.
The partial derivative of f^l(s) with respect to s is f'^l(s).
The delta for a node is d^l_i = E * f'^l(s).
The delta rule specifies that the change in weight vector a^l_ji = d^l_i * y^l-1_ji.

For an output node,