W!o+ 的《小伶鼬工坊演義》︰神經網絡【Softmax】二




成語說一字千金, Michael Nielsen 先生果然惜字如金,因此這一段文字易讀難通!!不得不加註貫串也??

The learning slowdown problem: We’ve now built up considerable familiarity with softmax layers of neurons. But we haven’t yet seen how a softmax layer lets us address the learning slowdown problem. To understand that, let’s define the log-likelihood cost function. We’ll use x to denote a training input to the network, and y to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is

C \equiv -\ln a^L_y. \ \ \ \ (80)

So, for instance, if we’re training with MNIST images, and input an image of a 7, then the log-likelihood cost is -\ln a^L_7. To see that this makes intuitive sense, consider the case when the network is doing a good job, that is, it is confident the input is a 7. In that case it will estimate a value for the corresponding probability a^L_7 which is close to 1, and so the cost -\ln a^L_7 will be small. By contrast, when the network isn’t doing such a good job, the probability a^L_7 will be smaller, and the cost -\ln a^L_7 will be larger. So the log-likelihood cost behaves as we’d expect a cost function to behave.


這個 C \equiv -\ln a^L_y 式子寫法或能達意,但是若從函式定義上講恐不正式。嚴謹一些的說,可以用『指示函數』

C_T \equiv \frac{1}{n} \sum \limits_x \sum \limits_{\alpha = 0}^{9} 1_{y_{\alpha}(x) = 1}  \cdot \left( - \ln (a^L_{\alpha}) \right)

。此處 1_{y_{\alpha}(x) = 1} 意指︰如果訓練資料『輸入樣本』是 x ,與此對應之『正確輸出』 \vec y 向量的 \alpha 分量值為 1,這時則取值為 1 ,否則取值為 0。而 n 是樣本總數。如是也可以明確『價格函式』

C = C_x = \sum \limits_{\alpha = 0}^{9} 1_{y_{\alpha}(x) = 1}  \cdot \left( - \ln (a^L_{\alpha}) \right)

之意義矣。雖然 C_x 貌似是十項 - \ln (a^L_j) 之和,事實上對於任一『輸入樣本』 x 而言,終究只有一項 a^L_{\alpha} 不為零。


What about the learning slowdown problem? To analyze that, recall that the key to the learning slowdown is the behaviour of the quantities \partial C / \partial w^L_{jk} and \partial C / \partial b^L_j. I won’t go through the derivation explicitly – I’ll ask you to do in the problems, below – but with a little algebra you can show that*

*Note that I’m abusing notation here, using y in a slightly different way to last paragraph. In the last paragraph we used y to denote the desired output from the network – e.g., output a “7” if an image of a 7 was input. But in the equations which follow I’m using y to denote the vector of output activations which corresponds to 7, that is, a vector which is all 0s, except for a 1 in the 7th location.

\frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j \ \ \ \ (81)
\frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \ \ \ \ (82)



\delta_{ik}1_{y_{\alpha}(x) = 1} 之定義關係以及利用微分『鏈式法則』自可得乎 ??

These equations are the same as the analogous expressions obtained in our earlier analysis of the cross-entropy. Compare, for example, Equation (82) to Equation (67). It’s the same equation, albeit in the latter I’ve averaged over training instances. And, just as in the earlier analysis, these expressions ensure that we will not encounter a learning slowdown. In fact, it’s useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well. Through the remainder of this chapter we’ll use a sigmoid output layer, with the cross-entropy cost. Later, in Chapter 6, we’ll sometimes use a softmax output layer, with log-likelihood cost. The reason for the switch is to make some of our later networks more similar to networks found in certain influential academic papers. As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That’s not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.


  • Derive Equations (81) and (82).
  • Where does the “softmax” name come from? Suppose we change the softmax function so the output activations are given by
    a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}, \ \ \ \ (83)

    where c is a positive constant. Note that c = 1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c \rightarrow \infty. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a “softened” version of the maximum function. This is the origin of the term “softmax”.

  • Backpropagation with softmax and the log-likelihood cost In the last chapter we derived the backpropagation algorithm for a network containing sigmoid layers. To apply the algorithm to a network with a softmax layer we need to figure out an expression for the error \delta^L_j \equiv \partial C / \partial z^L_j in the final layer. Show that a suitable expression is:
    \delta^L_j = a^L_j -y_j. \ \ \ \ (84)

    Using this expression we can apply the backpropagation algorithm to a network using a softmax output layer and the log-likelihood cost.


亦當能知為何應取 \delta^L_j = a^L_j -y_j 的了!!若問

a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}

c \rightarrow \infty 之『極限值』???假設\sum_k e^{c z^L_k} 裡, cz^L_{\alpha} 為最大。由於

\lim \limits_{c \to \infty} \frac{e^{c z^L_j}}{e^{c z^L_{\alpha}}} = 1 \ \ \ if  \ j =\alpha

\lim \limits_{c \to \infty} \frac{e^{c z^L_j}}{e^{c z^L_{\alpha}}} = 0 \ \ \ if  \ j \neq \alpha

故知 \vec {a}^{\ L} 向量只有 \alpha 分量為 1 ,其餘皆 0 的耶!!!

至於什麼是『Likelihood function』呢?多講怕離題太遠大而無當,且容遁之以維基百科詞條矣!




