W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】二

老鐵匠的生命鏈條

發布者︰淺陌安然

有個老鐵匠,他打的鐵比誰的都要牢靠,可是他不善言辭,賣出的鐵鏈很少。人家說他太老實,但他不管這些,仍舊把每一根鐵鏈都打得結結實實。他打的一條鐵鏈裝在一艘大海輪上作為主錨鏈,但卻從來沒有用過。一天晚上,海上風暴驟起,隨時都可能把船冲到礁石上。船上所有錨鏈都放下海裡,但很快都被掙斷,只有老鐵匠那條鐵鍊還緊緊拉着風口浪尖上的輪船。在無數個鏈環中,哪怕有一環斷裂,船上1000多名乘客和貨物都將被死神吞噬!經歷了一夜的暴風驟雨的考驗,老鐵匠的那條鐵鍊還牢牢抓著海底的岩石。當離明來到,風平浪靜,所有的人為此熱淚盈眶,歡騰不已……

 

點評︰成功源于對完美的苛求和點滴的積累,失敗源于一系列細小錯誤的累加。成功沒有捷徑,唯有精益求精;卓越源自嚴謹,重在細節完美。

 

若講用『小故事』宣說『大道理』,看似容易,實在是非常困難之事!所以此處 Michael Nielsen 先生異筆突起,大書特書一個小小『神經元』之『零壹學習』問題??當真是不可思議之筆法也!!

The cross-entropy cost function

Most of us find it unpleasant to be wrong. Soon after beginning to learn the piano I gave my first performance before an audience. I was nervous, and began playing the piece an octave too low. I got confused, and couldn’t continue until someone pointed out my error. I was very embarrassed. Yet while unpleasant, we also learn quickly when we’re decisively wrong. You can bet that the next time I played before an audience I played in the correct octave! By contrast, we learn more slowly when our errors are less well-defined.

Ideally, we hope and expect that our neural networks will learn fast from their errors. Is this what happens in practice? To answer this question, let’s look at a toy example. The example involves a neuron with just one input:

We’ll train this neuron to do something ridiculously easy: take the input 1 to the output 0. Of course, this is such a trivial task that we could easily figure out an appropriate weight and bias by hand, without using a learning algorithm. However, it turns out to be illuminating to use gradient descent to attempt to learn a weight and bias. So let’s take a look at how the neuron learns.

To make things definite, I’ll pick the initial weight to be 0.6 and the initial bias to be 0.9. These are generic choices used as a place to begin learning, I wasn’t picking them to be special in any way. The initial output from the neuron is 0.82, so quite a bit of learning will be needed before our neuron gets near the desired output, 0.0. Click on “Run” in the bottom right corner below to see how the neuron learns an output much closer to 0.0. Note that this isn’t a pre-recorded animation, your browser is actually computing the gradient, then using the gradient to update the weight and bias, and displaying the result. The learning rate is \eta = 0.15, which turns out to be slow enough that we can follow what’s happening, but fast enough that we can get substantial learning in just a few seconds. The cost is the quadratic cost function, C, introduced back in Chapter 1. I’ll remind you of the exact form of the cost function shortly, so there’s no need to go and dig up the definition. Note that you can run the animation multiple times by clicking on “Run” again.

Sigmoid-1

As you can see, the neuron rapidly learns a weight and bias that drives down the cost, and gives an output from the neuron of about 0.09. That’s not quite the desired output, 0.0, but it is pretty good. Suppose, however, that we instead choose both the starting weight and the starting bias to be 2.0. In this case the initial output is 0.98, which is very badly wrong. Let’s look at how the neuron learns to output 0 in this case. Click on “Run” again:

Sigmoid-2

Although this example uses the same learning rate (\eta =0.15), we can see that learning starts out much more slowly. Indeed, for the first 150 or so learning epochs, the weights and biases don’t change much at all. Then the learning kicks in and, much as in our first example, the neuron’s output rapidly moves closer to 0.0.

This behaviour is strange when contrasted to human learning. As I said at the beginning of this section, we often learn fastest when we’re badly wrong about something. But we’ve just seen that our artificial neuron has a lot of difficulty learning when it’s badly wrong – far more difficulty than when it’s just a little wrong. What’s more, it turns out that this behaviour occurs not just in this toy model, but in more general networks. Why is learning so slow? And can we find a way of avoiding this slowdown?

To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial derivatives of the cost function, \partial C/\partial w and \partial C / \partial b. So saying “learning is slow” is really the same as saying that those partial derivatives are small. The challenge is to understand why they are small. To understand that, let’s compute the partial derivatives. Recall that we’re using the quadratic cost function, which, from Equation (6), is given by

C = \frac{(y-a)^2}{2}, \ \ \ \ (54)

where a is the neuron’s output when the training input x = 1 is used, and y = 0 is the corresponding desired output. To write this more explicitly in terms of the weight and bias, recall that a = \sigma(z), where z = wx+b. Using the chain rule to differentiate with respect to the weight and bias we get

\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \ \ \ \ (55)
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z), \ \ \ \ (56)

where I have substituted x = 1 and y = 0. To understand the behaviour of these expressions, let’s look more closely at the \sigma'(z) term on the right-hand side. Recall the shape of the \sigma function:

 Sigmoid-3

We can see from this graph that when the neuron’s output is close to 1, the curve gets very flat, and so \sigma'(z) gets very small. Equations (55) and (56) then tell us that \partial C/\partial w and \partial C / \partial b get very small. This is the origin of the learning slowdown. What’s more, as we shall see a little later, the learning slowdown occurs for essentially the same reason in more general neural networks, not just the toy example we’ve been playing with.

───

 

所謂『平凡中見偉大,細微處真工夫。』是『深入淺出』者之本事 ,因此方能信手拈來作『平常』 trival 之論的耶!!??

想那『Sigmoid』因為特色具足,也與『感知器』  Perceptrons 有著千絲萬縷的聯繫,故而雀屏中選,所以能在『神經網絡』各類文章中脫穎而出乎??!!

然而人們卻常既希望『馬兒善跑』,又希望『馬兒不吃草』,於是『S 神經元』的『穩定性高』,反成了『改變慢』的哩???不過若非那條『S 曲線』之『兩頭平緩』果利『學習』嗎!!!更不要說人們還想排除︰只因小小的『樣本差異』,『S 神經元』的輸出就從『正確變成錯誤』的呀!!??

舉例來說,到底

閾值電壓

閾值電壓英語:Threshold voltage[1],又稱閾電壓[2]開啟電壓,通常指的是在TTLMOSFET的傳輸特性曲線(輸出電壓與輸入電壓關係圖線)中,在轉折區中點所對應的輸入電壓的值。

當器件由空乏向反轉轉變時,要經歷一個Si表面電子濃度等於電洞濃度的狀態。此時器件處於臨界導通狀態,器件的閘極電壓定義為閾值電壓,它是MOSFET的重要參數之一。

Threshold_formation_nowatermark

計算機仿真展現的奈米線MOSFET中反型溝道的形成(電子密度的變化)。閾值電壓在0.45V左右。

───

 

之『高低』與『好壞』該當如之何決斷的呢??!!

再者對『神經網絡』而言,人麼將怎樣論述『遺忘』對『學習』之重要性也???

最後僅補之以一圖,盼一目了然矣!!!

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> N = 100
>>> I = np.arange(-10,10, 1.0/N)
>>> Sigmoid = 1 / ( 1 + np.exp(-1 * I))
>>> plt.subplot(3,1,1)
<matplotlib.axes.AxesSubplot object at 0x350cd10>
>>> plt.plot(I, Sigmoid, 'k-')
[<matplotlib.lines.Line2D object at 0x3169a90>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x351db50>
>>> plt.ylabel('sigmoid output')
<matplotlib.text.Text object at 0x3521ad0>
>>> plt.subplot(3,1,2)
<matplotlib.axes.AxesSubplot object at 0x353f950>
>>> SigmoidPrime = Sigmoid * (1 - Sigmoid)
>>> plt.plot(I, SigmoidPrime, 'r-')
[<matplotlib.lines.Line2D object at 0x353fd10>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x3541bd0>
>>> plt.ylabel("sigmoid' output")
<matplotlib.text.Text object at 0x3546b50>
>>> plt.subplot(3,1,3)
<matplotlib.axes.AxesSubplot object at 0x38f69d0>
>>> Cost = Sigmoid * SigmoidPrime
>>> plt.plot(I, Cost, 'b-')
[<matplotlib.lines.Line2D object at 0x38f6d90>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x38fac50>
>>> plt.ylabel("Cost output")
<matplotlib.text.Text object at 0x38fdbd0>
>>> plt.show()
>>> 

 

Sigmoid-4