《風俗通義》‧《窮通》
《易》稱:「懸象著明,莫大乎於日月。」然時有昏晦。《詩》美:「滔滔江、漢,南北之紀。」然時有壅滯。《論語》:「固天縱之,莫盛於聖。」然時有困否。日月不失其體,故蔽而復明;江、漢不失其源,故窮而復通;聖久不失其德,故廢而復興。非唯聖人俾爾亶厚,夫有恆者亦允臻矣。是故君子厄窮而不閔,勞辱而不苟,樂天知命,無怨尤焉,故錄先否後喜曰《窮通》也。
於此篇章結尾之處, Michael Nielsen 先生忽講起『大圖全景』the big picture ,不知道這個『日月明象』是否可以得而見諸︰
Backpropagation: the big picture
As I’ve explained it, backpropagation presents two mysteries. First, what’s the algorithm really doing? We’ve developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It’s one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn’t mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I’ll address both these mysteries.
To improve our intuition about what the algorithm is doing, let’s imagine that we’ve made a small change k to some weight in the network, :
This suggests that a possible approach to computing is to carefully track how a small change in propagates to cause a small change in . If we can do that, being careful to express everything along the way in terms of easily computable quantities, then we should be able to compute .
Let’s try to carry this out. The change causes a small change in the activation of the neuron in the layer. This change is given by
The change in activation will cause changes in all the activations in the next layer, i.e., the layer. We’ll concentrate on the way just a single one of those activations is affected, say ,
Substituting in the expression from Equation (48), we get:
Of course, the change will, in turn, cause changes in the activations in the next layer. In fact, we can imagine a path all the way through the network from to , with each change in activation causing a change in the next activation, and, finally, a change in the cost at the output. If the path goes through activations then the resulting expression is
that is, we’ve picked up a type term for each additional neuron we’ve passed through, as well as the term at the end. This represents the change in due to changes in the activations along this particular path through the network. Of course, there’s many paths by which a change in can propagate to affect the cost, and we’ve been considering just a single path. To compute the total change in it is plausible that we should sum over all the possible paths between the weight and the final cost, i.e.,
where we’ve summed over all possible choices for the intermediate neurons along the path. Comparing with (47) we see that
Now, Equation (53) looks complicated. However, it has a nice intuitive interpretation. We’re computing the rate of change of with respect to a weight in the network. What the equation tells us is that every edge between two neurons in the network is associated with a rate factor which is just the partial derivative of one neuron’s activation with respect to the other neuron’s activation. The edge from the first weight to the first neuron has a rate factor . The rate factor for a path is just the product of the rate factors along the path. And the total rate of change is just the sum of the rate factors of all paths from the initial weight to the final cost. This procedure is illustrated here, for a single path:
Now, I’m not going to work through all this here. It’s messy and requires considerable care to work through all the details. If you’re up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.
What about the other mystery – how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier*
*There is one clever step required. In Equation (53) the intermediate variables are activations like . The clever idea is to switch to using weighted inputs, like , as the intermediate variables. If you don’t have this idea, and instead continue using the activations , the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. – short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It’s just a lot of hard work simplifying the proof I’ve sketched in this section.
───
若有人果按圖索驥,追跡細查相干關係函式︰
……
且已深曉『反向』精義︰
Next, we’ll prove (BP2), which gives an equation for the error in terms of the error in the next layer, . To do this, we want to rewrite in terms of . We can do this using the chain rule,
……
where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of . To evaluate the first term on the last line, note that
Differentiating, we obtain
Substituting back into (42) we obtain
This is just (BP2) written in component form.
───
或許僅需補之以『推演示意』︰
Derivation
Since backpropagation uses the gradient descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network. Assuming one output neuron,[note 2] the squared error function is:
- ,
where
- is the squared error,
- is the target output for a training sample, and
- is the actual output of the output neuron.
The factor of is included to cancel the exponent when differentiating. Later, the expression will be multiplied with an arbitrary learning rate, so that it doesn’t matter if a constant coefficient is introduced now.
For each neuron , its output is defined as
- .
The input to a neuron is the weighted sum of outputs of previous neurons. If the neuron is in the first layer after the input layer, the of the input layer are simply the inputs to the network. The number of input units to the neuron is . The variable denotes the weight between neurons and .
The activation function is in general non-linear and differentiable. A commonly used activation function is the logistic function:
which has a nice derivative of:
Finding the derivative of the error
Calculating the partial derivative of the error with respect to a weight is done using the chain rule twice:
In the last term of the right-hand side of the following, only one term in the sum depends on , so that
- .
If the neuron is in the first layer after the input layer, is just .
The derivative of the output of neuron with respect to its input is simply the partial derivative of the activation function (assuming here that the logistic function is used):
This is the reason why backpropagation requires the activation function to be differentiable.
The first term is straightforward to evaluate if the neuron is in the output layer, because then and
However, if is in an arbitrary inner layer of the network, finding the derivative with respect to is less obvious.
Considering as a function of the inputs of all neurons receiving input from neuron ,
and taking the total derivative with respect to , a recursive expression for the derivative is obtained:
Therefore, the derivative with respect to can be calculated if all the derivatives with respect to the outputs of the next layer – the one closer to the output neuron – are known.
Putting it all together:
with
To update the weight using gradient descent, one must choose a learning rate, . The change in weight, which is added to the old weight, is equal to the product of the learning rate and the gradient, multiplied by :
The is required in order to update in the direction of a minimum, not a maximum, of the error function.
For a single-layer network, this expression becomes the Delta Rule. To better understand how backpropagation works, here is an example to illustrate it: The Back Propagation Algorithm, page 20.
───
終必得到乎??!!