W!o+ 的《小伶鼬工坊演義》︰神經網絡【Softmax】三

佳人》杜甫

絕代有佳人,幽居在空谷。
自雲良家子,零落依草木。
關中昔喪敗,兄弟遭殺戮。
官高何足論?不得收骨肉。
世情惡衰歇,萬事隨轉燭。
夫婿輕薄兒,新人美如玉。
合昏尚知時,鴛鴦不獨宿。
但見新人笑,那聞舊人哭?
在山泉水清,出山泉水濁。
侍婢賣珠回,牽蘿補茅屋。
摘花不插發,採柏動盈掬。
天寒翠袖薄,日暮倚修竹。

 

絕代佳人奈何ㄉㄟˇ『牽蘿補茅屋』?這一個『補』補字

《説文解字》:補,完衣也。从衣,甫聲。

道盡亂世之艱難、人情的薄涼!衣破得『補』以求完好如初,屋漏得『補』以求遮風避雨,為何『學習』也得『補』呢??莫非曾經

坤 ䷁

六二:直,方,大,不習無不利。

如今得『補』之以

君子攸行,先迷失道,后順得常。

既然 Michael Nielsen 先生用意外之筆寫了『Softmax』串場文字,明指絕非龍套之事。因此為求『完備』,作者只能應事『補』上『史丹佛大學』之

Softmax Regression


Introduction

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: y^{(i)} \in \{0,1\}. We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle y^{(i)} \in \{1,\ldots,K\} where K is the number of classes.

Recall that in logistic regression, we had a training set \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \} of m labeled examples, where the input features are x^{(i)} \in \Re^{n}. With logistic regression, we were in the binary classification setting, so the labels were y^{(i)} \in \{0,1\}. Our hypothesis took the form:

h_\theta(x) = \frac{1}{1+\exp(-\theta^\top x)},

and the model parameters \theta were trained to minimize the cost function

J(\theta) = -\left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]

In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label y can take on K different values, rather than only two. Thus, in our training set \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}, we now have that y^{(i)} \in \{1, 2, \ldots, K\}. (Note that our convention will be to index the classes starting from 1, rather than from 0.) For example, in the MNIST digit recognition task, we would have K=10 different classes.

Given a test input x, we want our hypothesis to estimate the probability that P(y=k | x) for each value of k = 1, \ldots, K. I.e., we want to estimate the probability of the class label taking on each of the K different possible values. Thus, our hypothesis will output a K-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our hypothesis h_{\theta}(x) takes the form:

h_\theta(x) = \begin{bmatrix} P(y = 1 | x; \theta) \\ P(y = 2 | x; \theta) \\ \vdots \\ P(y = K | x; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\theta^{(j)\top} x) }} \begin{bmatrix} \exp(\theta^{(1)\top} x ) \\ \exp(\theta^{(2)\top} x ) \\ \vdots \\ \exp(\theta^{(K)\top} x ) \\ \end{bmatrix}

Here \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(K)} \in \Re^{n} are the parameters of our model. Notice that the term \frac{1}{ \sum_{j=1}^{K}{\exp(\theta^{(j)\top} x) } } normalizes the distribution, so that it sums to one.

For convenience, we will also write \theta to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent \theta as a n-by-K matrix obtained by concatenating \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(K)} into columns, so that

\theta = \left[\begin{array}{cccc}| & | & | & | \\ \theta^{(1)} & \theta^{(2)} & \cdots & \theta^{(K)} \\ | & | & | & | \end{array}\right].

Cost Function

We now describe the cost function that we’ll use for softmax regression. In the equation below, 1\{\cdot\} is the ”‘indicator function,”’ so that 1\{\hbox{a true statement}\}=1, and 1\{\hbox{a false statement}\}=0. For example, 1\{2+2=4\} evaluates to 1; whereas 1\{1+1=5\} evaluates to 0. Our cost function will be:

J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \log \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}\right]

Notice that this generalizes the logistic regression cost function, which could also have been written:

J(\theta) &= - \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ &= - \left[ \sum_{i=1}^{m} \sum_{k=0}^{1} 1\left\{y^{(i)} = k\right\} \log P(y^{(i)} = k | x^{(i)} ; \theta) \right]

The softmax cost function is similar, except that we now sum over the K different possible values of the class label. Note also that in softmax regression, we have that

P(y^{(i)} = k | x^{(i)} ; \theta) = \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) }

.

We cannot solve for the minimum of J(\theta) analytically, and thus as usual we’ll resort to an iterative optimization algorithm. Taking derivatives, one can show that the gradient is:

\nabla_{\theta^{(k)}} J(\theta) = - \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}; \theta) \right) \right] }

Recall the meaning of the ”\nabla_{\theta^{(k)}}” notation. In particular, \nabla_{\theta^{(k)}} J(\theta) is itself a vector, so that its j-th element is \frac{\partial J(\theta)}{\partial \theta_{lk}} the partial derivative of J(\theta) with respect to the j-th element of \theta^{(k)}.

Armed with this formula for the derivative, one can then plug it into a standard optimization package and have it minimize J(\theta).

………

 

,冀望觀者能目千里,用者能得他山之石也。