W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】五

Michael Nielsen 先生用了一大段文字解釋

Learning with gradient descent

Now that we have a design for our neural network, how can it learn to recognize digits? The first thing we’ll need is a data set to learn from – a so-called training data set. We’ll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST’s name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States’ National Institute of Standards and Technology. Here’s a few images from MNIST:

As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we’ll ask it to recognize images which aren’t in the training set!

The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We’ll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn’t see during training.

We’ll use the notation x to denote a training input. It’ll be convenient to regard each training input x as a 28 \times 28 =784dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We’ll denote the corresponding desired output by y=y(x), where y is a 10 -dimensional vector. For example, if a particular training image, x, depicts a 6, then y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T is the desired output from the network. Note that T here is the transpose operation, turning a row vector into an ordinary (column) vector.

What we’d like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x) for all training inputs x. To quantify how well we’re achieving this goal we define a cost function*

*Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it’s often used in research papers and other discussions of neural networks. :

C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 .


Here, w denotes the collection of all weights in the network, b all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input, and the sum is over all training inputs, x. Of course, the output a depends on x, w and b, but to keep the notation simple I haven’t explicitly indicated this dependence. The notation \| v \| just denotes the usual length function for a vector v. We’ll call C the quadratic cost function; it’s also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that C(w,b) is non-negative, since every term in the sum is non-negative. Furthermore, the cost C(w,b) becomes small, i.e., C(w,b) \approx 0, precisely when y(x) is approximately equal to the output, a, for all training inputs, x. So our training algorithm has done a good job if it can find weights and biases so that C(w,b) \approx 0. By contrast, it’s not doing so well when C(w,b) is large – that would mean that y(x) is not close to the output a for a large number of inputs. So the aim of our training algorithm will be to minimize the cost C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We’ll do that using an algorithm known as gradient descent.

……

Indeed, there’s even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let’s suppose that we’re trying to make a move \Delta v in position so as to decrease C as much as possible. This is equivalent to minimizing \Delta C \approx \nabla C \cdot \Delta v. We’ll constrain the size of the move so that \| \Delta v \| = \epsilon for some small fixed \epsilon >0. In other words, we want a move that is a small step of a fixed size, and we’re trying to find the movement direction which decreases C as much as possible. It can be proved that the choice of \Delta v which minimizes \nabla C \cdot \Delta v is \Delta v = - \eta \nabla C, where \eta = \epsilon / \|\nabla C\| is determined by the size constraint \|\Delta v\| = \epsilon . So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease C.

───

 

或許這麼費勁的原因,是希望讀者能夠直覺掌握

梯度下降法

梯度下降法英語:Gradient descent)是一個最佳化算法,通常也稱為最速下降法

描述

梯度下降法,基於這樣的觀察:如果實值函數F(\mathbf{x})在點\mathbf{a}可微且有定義,那麼函數F(\mathbf{x})\mathbf{a}點沿著梯度相反的方向 -\nabla F(\mathbf{a}) 下降最快。

因而,如果

\mathbf{b}=\mathbf{a}-\gamma\nabla F(\mathbf{a})

對於\gamma>0為一個夠小數值時成立,那麼F(\mathbf{a})\geq F(\mathbf{b})

考慮到這一點,我們可以從函數F的局部極小值的初始估計\mathbf{x}_0出發,並考慮如下序列 \mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \dots使得

\mathbf{x}_{n+1}=\mathbf{x}_n-\gamma_n \nabla F(\mathbf{x}_n),\ n \ge 0

因此可得到

F(\mathbf{x}_0)\ge F(\mathbf{x}_1)\ge F(\mathbf{x}_2)\ge \cdots,

如果順利的話序列(\mathbf{x}_n)收斂到期望的極值。注意每次疊代步長\gamma可以改變。

右側的圖片示例了這一過程,這裡假設F定義在平面上,並且函數圖像是一個形。藍色的曲線是等高線水平集),即函數F為常數的集合構成的曲線。紅色的箭頭指向該點梯度的反方向。(一點處的梯度方向與通過該點的等高線垂直)。沿著梯度下降方向,將最終到達碗底,即函數F值最小的點。

350px-Gradient_descent

梯度下降法的描述。

例子

梯度下降法處理一些複雜的非線性函數會出現問題,例如Rosenbrock函數

f(x, y) =(1-x)^2 + 100(y-x^2)^2 .\quad

其最小值在(x, y)=(1, 1)處,數值為f(x, y)=0。但是此函數具有狹窄彎曲的山谷,最小值(x, y)=(1, 1)就在這些山谷之中,並且谷底很平。優化過程是之字形的向極小值點靠近,速度非常緩慢。

Banana-SteepDesc.gif

───

 

然而『跨學科』之類比︰

Okay, so calculus doesn’t work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn’t be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We’d randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of C – those derivatives would tell us everything we need to know about the local “shape” of the valley, and therefore how our ball should roll.

Based on what I’ve just written, you might suppose that we’ll be trying to write down Newton’s equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we’re not going to take the ball-rolling analogy quite that seriously – we’re devising an algorithm to minimize C, not developing an accurate simulation of the laws of physics! The ball’s-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let’s simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?

───

 

是否可以促進『理解』?實有賴於『STEM』教育之落實!

Science, Technology, Engineering and Mathematics (STEM, previously SMET) is an education grouping used worldwide. The acronym refers to the academic disciplines of science[note 1], technology, engineering and mathematics.[1] The term is typically used when addressing education policy and curriculum choices in schools to improve competitiveness in science and technology development. It has implications for workforce development, national security concerns and immigration policy.[1] The acronym arose in common use shortly after an interagency meeting on science education held at the US National Science Foundation chaired by the then NSF director Rita Colwell. A director from the Office of Science division of Workforce Development for Teachers and Scientists, Dr. Peter Faletra, suggested the change from the older acronym SMET to STEM. Dr. Colwell, expressing some dislike for the older acronym, responded by suggesting NSF to institute the change. One of the first NSF projects to use the acronym was STEMTEC, the Science, Technology, Engineering and Math Teacher Education Collaborative at the University of Massachusetts Amherst, which was funded in 1997.

───

 

比方說,在《踏雪尋梅!!》一文中,我們談過︰

那麼科學上如何看待『預言』的呢?比方講一七四四年瑞士大數學家和物理學家萊昂哈德‧歐拉 Leonhard Euler 在《尋找具有極大值或極小值性質的曲線,等周問題的最廣義解答》 Methodus inveniendi lineas curvas maximi minimive proprietate gaudentes, sive solutio problematis isoperimetrici lattissimo sensu accepti 論文中,非常清晰明白的給出『最小作用量原理』的定義

假使一個質量為 M,速度為 v 的粒子移動無窮小距離 ds 時。這時粒子的動量為 M \cdot v,當乘以此無窮小距離 ds 後,給出 M \cdot v \ ds ,這是粒子的動量作用於無窮小『路徑ds 距離之上。我宣稱︰在所有連結『始終』兩個端點的可能『路徑』之中,這個粒子運動的真實『軌跡』是 \int_{initial}^{final}  M \cdot v \ ds 為最小值的『路徑』;如果假定質量是個常數,也就是\int_{initial}^{final}  v \ ds 為最小值的『軌道』。

也就是說,在所有連結『始終』兩個端點的可能『路徑path 之中, 粒子所選擇的『路徑』是『作用量A = \int_{path}  M \cdot v \ ds 泛函數的『極值』,這是牛頓第二運動定律的『變分法』Variation 描述。如果從今天物理能量的觀點來看 A = \int_{path}  M \cdot v \ ds = \int_{path}  M \cdot v \ \frac {ds}{dt} dt = \int_{path}  M \cdot v^2 dt = 2 \int_{path} T dt,此處 T = \frac{1}{2} M v^2 就是粒子的動能。因為牛頓第二運動定律可以表述為 F = M \cdot a = \frac {d P}{dt}, \ P = M \cdot v,所以 \int_{path}  \frac {d P}{dt} ds = \int_{path}  \frac {d s}{dt} dP = \int_{path}  v dP  = \Delta T = \int_{path}  F ds

假使粒子所受的力是『保守力』conservative force,也就是講此力沿著任何路徑所作的『』work 只跟粒子『始終』兩個端點的『位置』有關,與它行經的『路徑』無關。在物理上這時通常將它定義成這個『力場』的『位能V = - \int_{ref}^{position}  F ds,於是如果一個粒子在一個保守場中,\Delta T + \Delta V = 0,這就是物理上『能量守恆』原理!舉例來說重力、彈簧力、電場力等等,都是保守力,然而摩擦力和空氣阻力種種都是典型的非保守力。由於 \Delta V 在這些可能路徑裡都不變,因此『最小作用量原理』所確定的『路徑』也就是『作用量A 的『極值』。一七八八年法國籍義大利裔數學家和天文學家約瑟夫‧拉格朗日 Joseph Lagrange 對於變分法發展貢獻很大,最早在其論文《分析力學》Mecanique Analytique 裡,使用『能量守恆定律』推導出了歐拉陳述的最小作用量原理的正確性。

從數學上講運動的『微分方程式』等效於對應的『積分方程式』,這本不是什麼奇怪的事,當人們開始考察它的『哲學意義』,可就引發很多不同的觀點。有人說 F = m a 就像『結果 \propto 原因』描繪『因果』的『瞬刻聯繫』關係,這是一種『決定論』,從一個『時空點』推及『無窮小時距dt 接續的另一個『時空點』,因此一旦知道『初始狀態』,就已經確定了它的『最終結局』!有人講 A = \int_{initial}^{final}  M \cdot v \ ds 彷彿確定了『目的地』無論從哪個『起始處』出發,總會有一個『通達路徑』,這成了一種『目的論』,大自然自會找到『此時此處』通向『彼時彼處』的『道路』!!各種意義『詮釋』果真耶?宛如說『花開自有因,將要為誰妍』??

───

 

那麼一個已經明白

位能

位能的保守力定義

如果分別作用於兩個質點上的作用力與反作用力作功與具體路徑無關,只取決於交互作用質點初末位置,那麼這樣的一對力就叫作保守力。不滿足這個條件的則稱為非保守力。可以證明保守場的幾個等價條件[1],於是我們得到保守力的性質有:

  1. 保守力沿給定兩點間作功與路徑無關;
  2. 保守力沿任意環路作功為零;
  3. 保守力可以表示為一個純量函數的(負)梯度

推廣到多質點體系和連續分布物體,如果一封閉系統中任意兩個質點之間的作用力都是保守力,則稱該系統為保守體系。保守體系的位形,即在保守體系中各質點的相對位置發生變化時,其間的交互作用力作功,作功之和只與各質點相對位置有關。將保守體系在保守力作用下的這種與相對位置相聯繫的作功的能力定義為一個函數,稱為該保守體系的勢能函數位能函數,簡稱勢能位能[2]。這樣,體系從一種位形變為另一種位形時對外界所作的功等於後者與前者的位能之差,從而賦予了位能函數以直觀的物理意義

除此之外,我們還可以將位能的定義從現在的基礎上拓展。比如熱學中氣體分子間的交互作用位能,它是大量分子位能的和,實際不是用相對位置(位形)來描述的,而是用體積溫度壓強等熱學參量。又如,在一些特定的約束條件下,某些平時是非保守力的力也成為了保守力[3],或者幾種力的淨力恰巧成為了一個保守力。

───

 

以及

梯度定理

梯度定理英語:gradient theorem),也叫線積分基本定理,是說純量場梯度沿曲線的積分可用純量場在該曲線兩端的值之差來計算。

設函數 \varphi : U \subseteq \mathbb{R}^n \to \mathbb{R},則

 \varphi\left(\mathbf{q}\right)-\varphi\left(\mathbf{p}\right) = \int_{\gamma[\mathbf{p},\,\mathbf{q}]} \nabla\varphi(\mathbf{r})\cdot d\mathbf{r}.

梯度定理把微積分基本定理從直線數軸推廣到平面、空間,乃至一般的n維空間中的曲線。

梯度定理表明梯度場的曲線積分是路徑無關的,這是物理學中「保守力」的定義方式之一。如果 \varphi 位勢,則 \nabla\varphi 就是保守向量場。上面的公式表明:保守力做只和物體運動路徑的端點有關,而與路徑本身無關。

梯度定理有個逆定理,是說任何路徑無關的向量場都可以表示為某個純量場的梯度。這個逆定理和原定理一樣在純粹和應用數學中有很多推論和應用。

───

 

基本概念者,豈不了解『負梯度』其實源自自然的耶!!??