W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】八

風俗通義》‧《窮通
《易》稱:「懸象著明,莫大乎於日月。」然時有昏晦。《詩》美:「滔滔江、漢,南北之紀。」然時有壅滯。《論語》:「固天縱之,莫盛於聖。」然時有困否。日月不失其體,故蔽而復明;江、漢不失其源,故窮而復通;聖久不失其德,故廢而復興。非唯聖人俾爾亶厚,夫有恆者亦允臻矣。是故君子厄窮而不閔,勞辱而不苟,樂天知命,無怨尤焉,故錄先否後喜曰《窮通》也。

 

於此篇章結尾之處, Michael Nielsen 先生忽講起『大圖全景』the big picture不知道這個『日月明象』是否可以得而見諸︰

Backpropagation: the big picture

As I’ve explained it, backpropagation presents two mysteries. First, what’s the algorithm really doing? We’ve developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It’s one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn’t mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I’ll address both these mysteries.

To improve our intuition about what the algorithm is doing, let’s imagine that we’ve made a small change \Delta w^l_{jk}k to some weight in the network, w^l_{jk}:

That change in weight will cause a change in the output activation from the corresponding neuron:

That, in turn, will cause a change in all the activations in the next layer:

Those changes will in turn cause changes in the next layer, and then the next, and so on all the way through to causing a change in the final layer, and then in the cost function:

The change \Delta C in the cost is related to the change \Delta w^l_{jk} in the weight by the equation

\Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}. \ \ \ \ (47)

This suggests that a possible approach to computing \frac{\partial C}{\partial w^l_{jk}} is to carefully track how a small change in w^l_{jk} propagates to cause a small change in C. If we can do that, being careful to express everything along the way in terms of easily computable quantities, then we should be able to compute \partial C / \partial w^l_{jk}.

Let’s try to carry this out. The change \Delta w^l_{jk} causes a small change \Delta a^l_{j} in the activation of the j^{\rm th} neuron in the l^{\rm th} layer. This change is given by

\Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \ \ \ \ (48)

The change in activation \Delta a^l_{j} will cause changes in all the activations in the next layer, i.e., the (l+1)^{\rm th} layer. We’ll concentrate on the way just a single one of those activations is affected, say a^{l+1}_q,

In fact, it’ll cause the following change:

\Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j. \ \ \ \ (49)

Substituting in the expression from Equation (48), we get:

\Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \ \ \ \ (50)

Of course, the change \Delta a^{l+1}_q will, in turn, cause changes in the activations in the next layer. In fact, we can imagine a path all the way through the network from w^l_{jk} to C, with each change in activation causing a change in the next activation, and, finally, a change in the cost at the output. If the path goes through activations a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m then the resulting expression is

\Delta C \approx \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}, \ \ \ \ (51)

that is, we’ve picked up a \partial a / \partial a type term for each additional neuron we’ve passed through, as well as the C/\partial a^L_m term at the end. This represents the change in C due to changes in the activations along this particular path through the network. Of course, there’s many paths by which a change in w^l_{jk} can propagate to affect the cost, and we’ve been considering just a single path. To compute the total change in C it is plausible that we should sum over all the possible paths between the weight and the final cost, i.e.,

\Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}, \ \ \ \ (52)

where we’ve summed over all possible choices for the intermediate neurons along the path. Comparing with (47) we see that

\frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}. \ \ \ \ (53)

Now, Equation (53) looks complicated. However, it has a nice intuitive interpretation. We’re computing the rate of change of C with respect to a weight in the network. What the equation tells us is that every edge between two neurons in the network is associated with a rate factor which is just the partial derivative of one neuron’s activation with respect to the other neuron’s activation. The edge from the first weight to the first neuron has a rate factor \partial a^{l}_j / \partial w^l_{jk}. The rate factor for a path is just the product of the rate factors along the path. And the total rate of change \partial C / \partial w^l_{jk} is just the sum of the rate factors of all paths from the initial weight to the final cost. This procedure is illustrated here, for a single path:

What I’ve been providing up to now is a heuristic argument, a way of thinking about what’s going on when you perturb a weight in a network. Let me sketch out a line of thinking you could use to further develop this argument. First, you could derive explicit expressions for all the individual partial derivatives in Equation (53). That’s easy to do with a bit of calculus. Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications. This turns out to be tedious, and requires some persistence, but not extraordinary insight. After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm! And so you can think of the backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.

Now, I’m not going to work through all this here. It’s messy and requires considerable care to work through all the details. If you’re up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.

What about the other mystery – how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier*

*There is one clever step required. In Equation (53) the intermediate variables are activations like a_q^{l+1}. The clever idea is to switch to using weighted inputs, like z^{l+1}_q, as the intermediate variables. If you don’t have this idea, and instead continue using the activations a^{l+1}_q, the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. – short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It’s just a lot of hard work simplifying the proof I’ve sketched in this section.

───

 

若有人果按圖索驥,追跡細查相干關係函式︰

\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \ \ \ \ (29)

C = C({\vec a}^L)

{\vec a}^L = \sigma ({\vec z}^L)

\sigma(z) \equiv \frac{1}{1+e^{-z}}

{\vec z}^{\ l} \ \equiv \  W^{\ l} \ {\vec a}^{\ l-1}  \ + \ {\vec b}^{\ l}

{\vec z}^{\ l+1} \ \equiv \  W^{\ l+1} \ {\vec a}^{\ l}  \ + \ {\vec b}^{\ l+1}

……

 

且已深曉『反向』精義︰

Next, we’ll prove (BP2), which gives an equation for the error \delta^l in terms of the error in the next layer, \delta^{l+1}. To do this, we want to rewrite \delta^l_j = \partial C / \partial z^l_j in terms of \delta^{l+1}_k = \partial C / \partial z^{l+1}_k. We can do this using the chain rule,

\delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \ \ \ \ (40 - 42)

……

where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of \delta^{l+1}_k. To evaluate the first term on the last line, note that

z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k. \ \ \ \ (43)

Differentiating, we obtain

\frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \ \ \ \ (44)

Substituting back into (42) we obtain

\delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \ \ \ \ (45)

This is just (BP2) written in component form.

───

 

或許僅需補之以『推演示意』︰

Derivation

Since backpropagation uses the gradient descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network. Assuming one output neuron,[note 2] the squared error function is:

E = \tfrac{1}{2}(t - y)^2,

where

E is the squared error,
t is the target output for a training sample, and
y is the actual output of the output neuron.

The factor of \textstyle\frac{1}{2} is included to cancel the exponent when differentiating. Later, the expression will be multiplied with an arbitrary learning rate, so that it doesn’t matter if a constant coefficient is introduced now.

For each neuron j, its output o_{j} is defined as

o_{j} = \varphi(\mbox{net}_{j}) = \varphi\left(\sum_{k=1}^{n}w_{kj}o_k\right).

The input \mbox{net}_{j} to a neuron is the weighted sum of outputs o_k of previous neurons. If the neuron is in the first layer after the input layer, the o_k of the input layer are simply the inputs x_k to the network. The number of input units to the neuron is n. The variable w_{ij} denotes the weight between neurons i and j.

The activation function \varphi is in general non-linear and differentiable. A commonly used activation function is the logistic function:

 \varphi(z) = \frac{1}{1+e^{-z}}

which has a nice derivative of:

 \frac {d \varphi}{d z}(z) = \varphi(z)(1-\varphi(z))

Finding the derivative of the error

Calculating the partial derivative of the error with respect to a weight w_{ij} is done using the chain rule twice:

\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\mathrm{net_j}} \frac{\partial \mathrm{net_j}}{\partial w_{ij}}

In the last term of the right-hand side of the following, only one term in the sum \mathrm{net_j} depends on w_{ij}, so that

\frac{\partial \mathrm{net_j}}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}}\left(\sum_{k=1}^{n}w_{kj}o_k\right) = o_i.

If the neuron is in the first layer after the input layer, o_i is just x_i.

The derivative of the output of neuron j with respect to its input is simply the partial derivative of the activation function (assuming here that the logistic function is used):

\frac{\partial o_j}{\partial\mathrm{net_j}} = \frac {\partial}{\partial \mathrm{net_j}}\varphi(\mathrm{net_j}) = \varphi(\mathrm{net_j})(1-\varphi(\mathrm{net_j}))

This is the reason why backpropagation requires the activation function to be differentiable.

The first term is straightforward to evaluate if the neuron is in the output layer, because then o_j = y and

\frac{\partial E}{\partial o_j} = \frac{\partial E}{\partial y} = \frac{\partial}{\partial y} \frac{1}{2}(t - y)^2 = y - t

However, if j is in an arbitrary inner layer of the network, finding the derivative E with respect to o_j is less obvious.

Considering E as a function of the inputs of all neurons L = {u, v, \dots, w} receiving input from neuron j,

\frac{\partial E(o_j)}{\partial o_j} = \frac{\partial E(\mathrm{net}_u, \mathrm{net}_v, \dots, \mathrm{net}_w)}{\partial o_j}

and taking the total derivative with respect to o_j, a recursive expression for the derivative is obtained:

\frac{\partial E}{\partial o_j} = \sum_{l \in L} \left(\frac{\partial E}{\partial \mathrm{net}_l}\frac{\partial \mathrm{net}_l}{\partial o_j}\right) = \sum_{l \in L} \left(\frac{\partial E}{\partial o_{l}}\frac{\partial o_{l}}{\partial \mathrm{net}_l}w_{jl}\right)

Therefore, the derivative with respect to o_j can be calculated if all the derivatives with respect to the outputs o_l of the next layer – the one closer to the output neuron – are known.

Putting it all together:

\dfrac{\partial E}{\partial w_{ij}} = \delta_{j} o_{i}

with

\delta_{j} = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\mathrm{net_j}} = \begin{cases} (o_{j}-t_{j})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an output neuron,}\\ (\sum_{l\in L} \delta_{l} w_{jl})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an inner neuron.} \end{cases}

To update the weight w_{ij} using gradient descent, one must choose a learning rate, \alpha. The change in weight, which is added to the old weight, is equal to the product of the learning rate and the gradient, multiplied by -1:

 \Delta w_{ij} = - \alpha \frac{\partial E}{\partial w_{ij}} = \begin{cases} - \alpha o_{i} (o_{j}-t_{j})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an output neuron,}\\ - \alpha o_{i} (\sum_{l\in L} \delta_{l} w_{jl})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an inner neuron.} \end{cases}

The \textstyle -1 is required in order to update in the direction of a minimum, not a maximum, of the error function.

For a single-layer network, this expression becomes the Delta Rule. To better understand how backpropagation works, here is an example to illustrate it: The Back Propagation Algorithm, page 20.

───

 

終必得到乎??!!