Backpropagation
David Wallace Croft
2005-02-18
Introduction
This is a derivation of a popular neural network training algorithm
known as backpropagation. It is commonly used with feed-forward multilayer
perceptrons.
To see backpropagation in action, check out the applet BackpropXOR in the
CroftSoft Collection.
Definitions and Identities
-
Sigmoid Function
-
Derivative of the Sigmoid Function
Simplest Case
This is a derivation of the backpropagation neural network training
algorithm for the simplest case: a two-layer network with one neuron
in each layer. In this derivation, all of the variables are scalars.
Deriving the learning rule for the simplest case makes backpropagation
easier to understand when vectors and matrices are introduced later.
s → v → b → σ → h →
w → a → σ → r
A signal (s) is the input to the first layer hidden neuron. It is
multiplied by the first layer weight (v). The weighted input (b) is then
squashed by the nonlinear neuron to fall between zero and one using the
sigmoidal function (σ). The output of the hidden neuron (h) is then
multiplied by the weight (w). The weighted input from the hidden neuron (a)
is fed to the squashing function of the second layer output neuron.
The difference between the desired response (d)
and the actual response (r) of the output neuron is the error (e).
The objective function (l) is the scaled square of the error (e).
-
Objective Function
-
Error Function
-
Response from Output Neuron
-
Weighted Input from Hidden Neuron to Output Neuron
-
Hidden Neuron Output
-
Weighted Input to Hidden Neuron
Output Layer
We seek to minimize the objective function (l) by modifying the weight (w)
to the output layer. This is done by taking the partial derivative of the
objective function with respect to the weights.
This derivative is then used to modify the weight using a gradient descent
algorithm.
-
Gradient for Output Layer Weight, Chain Rule
-
Output Layer Weight Gradient, First Term
-
Output Layer Weight Gradient, Second Term
-
Output Layer Weight Gradient, Third Term
-
Output Layer Weight Gradient, Fourth Term
-
Gradient for Output Layer Weight, Combined Terms
Hidden Layer
We also seek to minimize the objective function (l) by modifying the weight
(v) to the hidden layer. The derivation of the first three terms in the
gradient for the hidden layer weight are skipped as they are the same as the
first three terms in the gradient for the output layer derived previously.
-
Gradient for Hidden Layer Weight, Chain Rule
-
Hidden Layer Weight Gradient, Fourth Term
-
Hidden Layer Weight Gradient, Fifth Term
-
Hidden Layer Weight Gradient, Sixth Term
-
Gradient for Hidden Layer Weight, Combined Terms
Local Gradient
We noted that the first three terms of the gradients for the output layer
and the hidden layer are the same. The negative product of these three
terms is the local gradient. The local gradient can be computed for
the output layer and then reused for the hidden layer.
-
Local Gradient for Output Neuron
-
Output Layer Gradient Abbreviated
-
Hidden Layer Gradient Abbreviated
Backpropagation
The local gradient shows how the objective function decreases as the
weighted input to the neuron increases. The local gradient for an output
neuron depends on the error (e). The local gradients for hidden layer
neurons, however, depend on the local gradients of the following layer.
Thus the computation of the local gradients must be propagated from the back
layer of the network to the front. The weight gradients can then be
expressed as functions of the local gradients and the unweighted inputs.
-
Local Gradient for Hidden Neuron
-
Local Gradient Backpropagation
-
The Pattern for Hidden Layers
-
Hidden Layer Gradient Abbreviated Further
Gradient Descent
Now that we have the gradient of the objective function with respect to the
weights in terms of the of local gradients, we can incrementally decrease
the objective function, and therefore the overall error, by shifting the
weights.
-
Weight Update Rule
-
Descend the Gradient
-
A Negative Becomes a Positive
-
The Weight Update Rule Revised
-
Similar Pattern for Hidden Weights
Links
|