W!o+ 的《小伶鼬工坊演義》︰神經網絡【Softmax】二

《史記/卷085》呂不韋

呂不韋者,陽翟大賈人也。往來販賤賣貴,家累千金。

……

呂不韋列… : 當是時,魏有信陵君,楚有春申君,趙有平原君,齊有孟嘗君,皆下士喜賓客以相傾。呂不韋以秦之彊,羞不如,亦招致士,厚遇之,至食客三千人。是時諸侯多辯士,如荀卿之徒,著書布天下。呂不韋乃使其客人人著所聞,集論以為八覽、六論、十二紀,二十餘萬言。以為備天地萬物古今之事,號曰呂氏春秋。布咸陽市門,懸千金其上,延諸侯游士賓客有能增損一字者予千金

───

 

成語說一字千金, Michael Nielsen 先生果然惜字如金,因此這一段文字易讀難通!!不得不加註貫串也??

The learning slowdown problem: We’ve now built up considerable familiarity with softmax layers of neurons. But we haven’t yet seen how a softmax layer lets us address the learning slowdown problem. To understand that, let’s define the log-likelihood cost function. We’ll use x to denote a training input to the network, and y to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is

C \equiv -\ln a^L_y. \ \ \ \ (80)

So, for instance, if we’re training with MNIST images, and input an image of a 7, then the log-likelihood cost is -\ln a^L_7. To see that this makes intuitive sense, consider the case when the network is doing a good job, that is, it is confident the input is a 7. In that case it will estimate a value for the corresponding probability a^L_7 which is close to 1, and so the cost -\ln a^L_7 will be small. By contrast, when the network isn’t doing such a good job, the probability a^L_7 will be smaller, and the cost -\ln a^L_7 will be larger. So the log-likelihood cost behaves as we’d expect a cost function to behave.

 

這個 C \equiv -\ln a^L_y 式子寫法或能達意,但是若從函式定義上講恐不正式。嚴謹一些的說,可以用『指示函數』

Indicator function

In mathematics, an indicator function or a characteristic function is a function defined on a set X that indicates membership of an element in a subset A of X, having the value 1 for all elements of A and the value 0 for all elements of X not in A. It is usually denoted by a symbol 1 or I, sometimes in boldface or blackboard boldface, with a subscript describing the set.

Definition

The indicator function of a subset A of a set X is a function

\mathbf{1}_A \colon X \to \{ 0,1 \} \,

defined as

\mathbf{1}_A(x) := \begin{cases} 1 &\text{if } x \in A, \\ 0 &\text{if } x \notin A. \end{cases}

The Iverson bracket allows the equivalent notation, [x\in A], to be used instead of \mathbf{1}_A(x).

The function \mathbf{1}_A is sometimes denoted I_A, \chi_A or even just A. (The Greek letter \chi appears because it is the initial letter of the Greek word characteristic.)

───

 

記作

C_T \equiv \frac{1}{n} \sum \limits_x \sum \limits_{\alpha = 0}^{9} 1_{y_{\alpha}(x) = 1}  \cdot \left( - \ln (a^L_{\alpha}) \right)

。此處 1_{y_{\alpha}(x) = 1} 意指︰如果訓練資料『輸入樣本』是 x ,與此對應之『正確輸出』 \vec y 向量的 \alpha 分量值為 1,這時則取值為 1 ,否則取值為 0。而 n 是樣本總數。如是也可以明確『價格函式』

C = C_x = \sum \limits_{\alpha = 0}^{9} 1_{y_{\alpha}(x) = 1}  \cdot \left( - \ln (a^L_{\alpha}) \right)

之意義矣。雖然 C_x 貌似是十項 - \ln (a^L_j) 之和,事實上對於任一『輸入樣本』 x 而言,終究只有一項 a^L_{\alpha} 不為零。

 

What about the learning slowdown problem? To analyze that, recall that the key to the learning slowdown is the behaviour of the quantities \partial C / \partial w^L_{jk} and \partial C / \partial b^L_j. I won’t go through the derivation explicitly – I’ll ask you to do in the problems, below – but with a little algebra you can show that*

*Note that I’m abusing notation here, using y in a slightly different way to last paragraph. In the last paragraph we used y to denote the desired output from the network – e.g., output a “7” if an image of a 7 was input. But in the equations which follow I’m using y to denote the vector of output activations which corresponds to 7, that is, a vector which is all 0s, except for a 1 in the 7th location.

\frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j \ \ \ \ (81)
\frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \ \ \ \ (82)

 

此式考察

Softmax function

Artificial neural networks

In neural network simulations, the softmax function is often implemented at the final layer of a network used for classification. Such networks are then trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

 \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \dots = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k))

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

───

 

\delta_{ik}1_{y_{\alpha}(x) = 1} 之定義關係以及利用微分『鏈式法則』自可得乎 ??

These equations are the same as the analogous expressions obtained in our earlier analysis of the cross-entropy. Compare, for example, Equation (82) to Equation (67). It’s the same equation, albeit in the latter I’ve averaged over training instances. And, just as in the earlier analysis, these expressions ensure that we will not encounter a learning slowdown. In fact, it’s useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well. Through the remainder of this chapter we’ll use a sigmoid output layer, with the cross-entropy cost. Later, in Chapter 6, we’ll sometimes use a softmax output layer, with log-likelihood cost. The reason for the switch is to make some of our later networks more similar to networks found in certain influential academic papers. As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That’s not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.

Problems

  • Derive Equations (81) and (82).
  • Where does the “softmax” name come from? Suppose we change the softmax function so the output activations are given by
    a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}, \ \ \ \ (83)

    where c is a positive constant. Note that c = 1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c \rightarrow \infty. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a “softened” version of the maximum function. This is the origin of the term “softmax”.

  • Backpropagation with softmax and the log-likelihood cost In the last chapter we derived the backpropagation algorithm for a network containing sigmoid layers. To apply the algorithm to a network with a softmax layer we need to figure out an expression for the error \delta^L_j \equiv \partial C / \partial z^L_j in the final layer. Show that a suitable expression is:
    \delta^L_j = a^L_j -y_j. \ \ \ \ (84)

    Using this expression we can apply the backpropagation algorithm to a network using a softmax output layer and the log-likelihood cost.

 

亦當能知為何應取 \delta^L_j = a^L_j -y_j 的了!!若問

a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}

c \rightarrow \infty 之『極限值』???假設\sum_k e^{c z^L_k} 裡, cz^L_{\alpha} 為最大。由於

\lim \limits_{c \to \infty} \frac{e^{c z^L_j}}{e^{c z^L_{\alpha}}} = 1 \ \ \ if  \ j =\alpha

\lim \limits_{c \to \infty} \frac{e^{c z^L_j}}{e^{c z^L_{\alpha}}} = 0 \ \ \ if  \ j \neq \alpha

故知 \vec {a}^{\ L} 向量只有 \alpha 分量為 1 ,其餘皆 0 的耶!!!

至於什麼是『Likelihood function』呢?多講怕離題太遠大而無當,且容遁之以維基百科詞條矣!

似然函數

數理統計學中,似然函數是一種關於統計模型中的參數函數,表示模型參數中的似然性。似然函數在統計推斷中有重大作用,如在最大似然估計費雪信息之中的應用等等。「似然性」與「或然性」或「機率」意思相近,都是指某種事件發生的可能性,但是在統計學中,「似然性」和「或然性」或「機率」又有明確的區分。機率用於在已知一些參數的情況下,預測接下來的觀測所得到的結果,而似然性則是用於在已知某些觀測所得到的結果時,對有關事物的性質的參數進行估計。

在這種意義上,似然函數可以理解為條件機率的逆反。在已知某個參數B時,事件A會發生的機率寫作:

P(A \mid B) = \frac{P(A , B)}{P(B)} \!

利用貝氏定理

P(B \mid A) = \frac{P(A \mid B)\;P(B)}{P(A)} \!

因此,我們可以反過來構造表示似然性的方法:已知有事件A發生,運用似然函數\mathbb{L}(B \mid A),我們估計參數B的可能性。形式上,似然函數也是一種條件機率函數,但我們關注的變量改變了:

b\mapsto P(A \mid B=b) \!

注意到這裡並不要求似然函數滿足歸一性:\sum_{b \in \mathcal{B}}P(A \mid B=b) = 1。一個似然函數乘以一個正的常數之後仍然是似然函數。對所有\alpha > 0,都可以有似然函數:

L(b \mid A) = \alpha \; P(A \mid B=b) \!

例子

考慮投擲一枚硬幣的實驗。通常來說,已知投出的硬幣正面朝上和反面朝上的機率各自是p_H = 0.5,便可以知道投擲若干次後出現各種結果的可能性。比如說,投兩次都是正面朝上的機率是0.25。用條件機率表示,就是:

P(\mbox{HH} \mid p_H = 0.5) = 0.5^2 = 0.25

其中H表示正面朝上。

在統計學中,我們關心的是在已知一系列投擲的結果時,關於硬幣投擲時正面朝上的可能性的信息。
我們可以建立一個統計模型:假設硬幣投出時會有p_H 的機率正面朝上,而有1 - p_H的機率反面朝上。
這時,條件機率可以改寫成似然函數:

L(p_H = 0.5 \mid \mbox{HH}) = P(\mbox{HH}\mid p_H = 0.5) =0.25

也就是說,對於取定的似然函數,在觀測到兩次投擲都是正面朝上時,p_H = 0.5似然性是0.25(這並不表示當觀測到兩次正面朝上時p_H= 0.5機率是0.25)。

300px-LikelihoodFunctionAfterHH

兩次投擲都正面朝上時的似然函數

如果考慮p_H = 0.6,那麼似然函數的值也會改變。

L(p_H = 0.6 \mid \mbox{HH}) = P(\mbox{HH}\mid p_H = 0.6) =0.36

注意到似然函數的值變大了。
這說明,如果參數p_H的取值變成0.6的話,結果觀測到連續兩次正面朝上的機率要比假設p_H = 0.5時更大。也就是說,參數p_H取成0.6要比取成0.5更有說服力,更為「合理」。總之,似然函數的重要性不是它的具體取值,而是當參數變化時函數到底變小還是變大。對同一個似然函數,如果存在一個參數值,使得它的函數值達到最大的話,那麼這個值就是最為「合理」的參數值。

在這個例子中,似然函數實際上等於:

L(p_H = \theta \mid \mbox{HH}) = P(\mbox{HH}\mid p_H = \theta) =\theta^2 ,其中0 \le p_H \le 1

如果取p_H = 1,那麼似然函數達到最大值1。也就是說,當連續觀測到兩次正面朝上時,假設硬幣投擲時正面朝上的機率為1是最合理的。

類似地,如果觀測到的是三次投擲硬幣,頭兩次正面朝上,第三次反面朝上,那麼似然函數將會是:

L(p_H = \theta \mid \mbox{HHT}) = P(\mbox{HHT}\mid p_H = \theta) =\theta^2(1 - \theta) ,其中T表示反面朝上,0 \le p_H \le 1

這時候,似然函數的最大值將會在p_H = \frac{2}{3}的時候取到。也就是說,當觀測到三次投擲中前兩次正面朝上而後一次反面朝上時,估計硬幣投擲時正面朝上的機率p_H = \frac{2}{3}是最合理的。

300px-LikelihoodFunctionAfterHHT

三次投擲中頭兩次正面朝上,第三次反面朝上時的似然函數

───