W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】二







若講用『小故事』宣說『大道理』,看似容易,實在是非常困難之事!所以此處 Michael Nielsen 先生異筆突起,大書特書一個小小『神經元』之『零壹學習』問題??當真是不可思議之筆法也!!

The cross-entropy cost function

Most of us find it unpleasant to be wrong. Soon after beginning to learn the piano I gave my first performance before an audience. I was nervous, and began playing the piece an octave too low. I got confused, and couldn’t continue until someone pointed out my error. I was very embarrassed. Yet while unpleasant, we also learn quickly when we’re decisively wrong. You can bet that the next time I played before an audience I played in the correct octave! By contrast, we learn more slowly when our errors are less well-defined.

Ideally, we hope and expect that our neural networks will learn fast from their errors. Is this what happens in practice? To answer this question, let’s look at a toy example. The example involves a neuron with just one input:

We’ll train this neuron to do something ridiculously easy: take the input 1 to the output 0. Of course, this is such a trivial task that we could easily figure out an appropriate weight and bias by hand, without using a learning algorithm. However, it turns out to be illuminating to use gradient descent to attempt to learn a weight and bias. So let’s take a look at how the neuron learns.

To make things definite, I’ll pick the initial weight to be 0.6 and the initial bias to be 0.9. These are generic choices used as a place to begin learning, I wasn’t picking them to be special in any way. The initial output from the neuron is 0.82, so quite a bit of learning will be needed before our neuron gets near the desired output, 0.0. Click on “Run” in the bottom right corner below to see how the neuron learns an output much closer to 0.0. Note that this isn’t a pre-recorded animation, your browser is actually computing the gradient, then using the gradient to update the weight and bias, and displaying the result. The learning rate is \eta = 0.15, which turns out to be slow enough that we can follow what’s happening, but fast enough that we can get substantial learning in just a few seconds. The cost is the quadratic cost function, C, introduced back in Chapter 1. I’ll remind you of the exact form of the cost function shortly, so there’s no need to go and dig up the definition. Note that you can run the animation multiple times by clicking on “Run” again.


As you can see, the neuron rapidly learns a weight and bias that drives down the cost, and gives an output from the neuron of about 0.09. That’s not quite the desired output, 0.0, but it is pretty good. Suppose, however, that we instead choose both the starting weight and the starting bias to be 2.0. In this case the initial output is 0.98, which is very badly wrong. Let’s look at how the neuron learns to output 0 in this case. Click on “Run” again:


Although this example uses the same learning rate (\eta =0.15), we can see that learning starts out much more slowly. Indeed, for the first 150 or so learning epochs, the weights and biases don’t change much at all. Then the learning kicks in and, much as in our first example, the neuron’s output rapidly moves closer to 0.0.

This behaviour is strange when contrasted to human learning. As I said at the beginning of this section, we often learn fastest when we’re badly wrong about something. But we’ve just seen that our artificial neuron has a lot of difficulty learning when it’s badly wrong – far more difficulty than when it’s just a little wrong. What’s more, it turns out that this behaviour occurs not just in this toy model, but in more general networks. Why is learning so slow? And can we find a way of avoiding this slowdown?

To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial derivatives of the cost function, \partial C/\partial w and \partial C / \partial b. So saying “learning is slow” is really the same as saying that those partial derivatives are small. The challenge is to understand why they are small. To understand that, let’s compute the partial derivatives. Recall that we’re using the quadratic cost function, which, from Equation (6), is given by

C = \frac{(y-a)^2}{2}, \ \ \ \ (54)

where a is the neuron’s output when the training input x = 1 is used, and y = 0 is the corresponding desired output. To write this more explicitly in terms of the weight and bias, recall that a = \sigma(z), where z = wx+b. Using the chain rule to differentiate with respect to the weight and bias we get

\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \ \ \ \ (55)
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z), \ \ \ \ (56)

where I have substituted x = 1 and y = 0. To understand the behaviour of these expressions, let’s look more closely at the \sigma'(z) term on the right-hand side. Recall the shape of the \sigma function:


We can see from this graph that when the neuron’s output is close to 1, the curve gets very flat, and so \sigma'(z) gets very small. Equations (55) and (56) then tell us that \partial C/\partial w and \partial C / \partial b get very small. This is the origin of the learning slowdown. What’s more, as we shall see a little later, the learning slowdown occurs for essentially the same reason in more general neural networks, not just the toy example we’ve been playing with.



所謂『平凡中見偉大,細微處真工夫。』是『深入淺出』者之本事 ,因此方能信手拈來作『平常』 trival 之論的耶!!??

想那『Sigmoid』因為特色具足,也與『感知器』  Perceptrons 有著千絲萬縷的聯繫,故而雀屏中選,所以能在『神經網絡』各類文章中脫穎而出乎??!!

然而人們卻常既希望『馬兒善跑』,又希望『馬兒不吃草』,於是『S 神經元』的『穩定性高』,反成了『改變慢』的哩???不過若非那條『S 曲線』之『兩頭平緩』果利『學習』嗎!!!更不要說人們還想排除︰只因小小的『樣本差異』,『S 神經元』的輸出就從『正確變成錯誤』的呀!!??



閾值電壓英語:Threshold voltage[1],又稱閾電壓[2]開啟電壓,通常指的是在TTLMOSFET的傳輸特性曲線(輸出電壓與輸入電壓關係圖線)中,在轉折區中點所對應的輸入電壓的值。









>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> N = 100
>>> I = np.arange(-10,10, 1.0/N)
>>> Sigmoid = 1 / ( 1 + np.exp(-1 * I))
>>> plt.subplot(3,1,1)
<matplotlib.axes.AxesSubplot object at 0x350cd10>
>>> plt.plot(I, Sigmoid, 'k-')
[<matplotlib.lines.Line2D object at 0x3169a90>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x351db50>
>>> plt.ylabel('sigmoid output')
<matplotlib.text.Text object at 0x3521ad0>
>>> plt.subplot(3,1,2)
<matplotlib.axes.AxesSubplot object at 0x353f950>
>>> SigmoidPrime = Sigmoid * (1 - Sigmoid)
>>> plt.plot(I, SigmoidPrime, 'r-')
[<matplotlib.lines.Line2D object at 0x353fd10>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x3541bd0>
>>> plt.ylabel("sigmoid' output")
<matplotlib.text.Text object at 0x3546b50>
>>> plt.subplot(3,1,3)
<matplotlib.axes.AxesSubplot object at 0x38f69d0>
>>> Cost = Sigmoid * SigmoidPrime
>>> plt.plot(I, Cost, 'b-')
[<matplotlib.lines.Line2D object at 0x38f6d90>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x38fac50>
>>> plt.ylabel("Cost output")
<matplotlib.text.Text object at 0x38fdbd0>
>>> plt.show()

