W!o+ 的《小伶鼬工坊演義》︰神經網絡【Perceptron】八

且讓我們再次回顧『感知器』之模型︰

Perceptrons

What is a neural network? To get started, I’ll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it’s more common to use other models of artificial neurons – in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We’ll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it’s worth taking the time to first understand perceptrons.

So how do perceptrons work? A perceptron takes several binary inputs, x_1, x_2, \cdots, and produces a single binary output:

In the example shown the perceptron has three inputs, x_1, x_2, x_3. In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced weights,w_1,w_2,\cdots, real numbers expressing the importance of the respective inputs to the output. The neuron’s output, 0 or 1, is determined by whether the weighted sum \sum_j w_j x_j is less than or greater than some threshold value. Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms:

output = \begin{cases}0 & \text{if } \sum_j w_j x_j \leq \ threshold\\1 & \text{if} \ \sum_j w_j x_j > \ threshold\end{cases}

That’s all there is to how a perceptron works!

……

 

如果將 \sum_j w_j x_j = threshold 看成一個『超平面』,那麼 w_1,w_2,\cdots 就是『法向量』垂直這個『超平面』。

因此一個『感知器』的『輸出』 1, 0 值決定於『輸入』點 x_1, x_2, x_3, \cdots 與此『超平面』【上】、【內、下】之關係。

由於『感知器』的『輸出』 是『離散的』 1, 0 ,所以我們很難確定與 x_1, x_2, x_3, \cdots 『鄰近』之『輸入』點 x_1 + \Delta x_1, x_2 +\Delta x_2, x_3 + \Delta x_3, \cdots 是否會有『相同』的『輸出』值?這要是針對『學習』而言,就是『法向量』與『閾 』 threshold 值的『微小』改變,恐將導致先前『學習結果』的混亂!

因此 Michael Nielsen 接續談及

Sigmoid neurons

Learning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we’d like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we’d like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we’d like is for this small change in weight to cause only a small corresponding change in the output from the network. As we’ll see in a moment, this property will make learning possible. Schematically, here’s what we want (obviously this network is too simple to do handwriting recognition!):

 

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an “8” when it should be a “9”. We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a “9”. And then we’d repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.

───

 

為什麼要用『 S 神經元』的呢?因為這個大名鼎鼎的『 S 函數

Sigmoid function

A sigmoid function is a mathematical function having an “S” shape (sigmoid curve). Often, sigmoid function refers to the special case of the logistic function shown in the first figure and defined by the formula

S(t) = \frac{1}{1 + e^{-t}}.

Other examples of similar shapes include the Gompertz curve (used in modeling systems that saturate at large values of t) and the ogee curve (used in the spillway of some dams). A wide variety of sigmoid functions have been used as the activation function of artificial neurons, including the logistic and hyperbolic tangent functions. Sigmoid curves are also common in statistics as cumulative distribution functions, such as the integrals of the logistic distribution, the normal distribution, and Student’s t probability density functions.

───

 

早就為人所熟知!一段文本或許更能說明它與『感知器』 千絲萬縷之聯繫的吧!!??

150px-Pierre_Francois_Verhulst

Logistic-curve.svg

P(t) = \frac{1}{1 + \mathrm e^{-t}}

350px-Logit.svg

\operatorname{logit}(p)=\log\left( \frac{p}{1-p} \right)

220px-Linear_regression.svg

Y \approx F(X, \Box)

Maple_logistic_plot_small

x_{n+1} = r x_n(1 - x_n)

Logistic_map_animation

Logistic_map_phase_plot_of_x-n+1--x-n-_vs_x-n-

相圖

512px-LogisticMap_BifurcationDiagram

Logistic_map

Logistic_map_scatterplots_large

LogisticCobwebChaos
定點震盪混沌

200px-Ganzhi001

300px-NewtonIteration_Ani

一八三八年,比利時數學家 Pierre François Verhulst 發表了一個『人口成長』方程式,

\frac{dN}{dt} = r N \left(1 - \frac {N}{K} \right)

,此處 N(t) 是某時的人口數,r 是自然成長率, K 是環境承載力。求解後得到

N(t) = \frac{K}{1+ C K e^{-rt}}

,此處 C = \frac{1}{N(0)} - \frac{1}{K} 是初始條件。 Verhulst 將這個函數稱作『logistic function』,於是那個微分方程式也就叫做『 logistic equation』。假使用 P = \frac{N}{K} 改寫成 \frac{dP}{dt} = r P \left(1 - P \right),將它『標準化』,取 CK = 1r = 1,從左圖的解答來看, 0 < P <1,也就是講人口數成長不可能超過環境承載力的啊!

如果求 P(t) 的反函數,得到 t = \ln{\frac {1 -P}{P}},這個反函數被稱之為『Logit』函數,定義為

\operatorname{logit}(p)=\log\left( \frac{p}{1-p} \right) , \ 0 < p < 1

,一般常用於『二元選擇』,比方說『To Be or Not To Be』的『機率分佈』,也用於『迴歸分析』 Regression Analysis 來看看兩個『變量』在統計上是『相干』還是『無干』的ㄡ!假使試著用『無窮小』 數來看 \log\left( \frac{\delta p}{1-\delta p} \right) = \log(\delta p) \approx - \infty\log\left( \frac{1-\delta p} {\delta p}\right) = \log(\frac{1}{\delta p}) = \log(H) \approx \infty,或許更能體會『兩極性』的吧!!

一九七六年,澳洲科學家 Robert McCredie May 發表了一篇《Simple mathematical models with very complicated dynamics》文章,提出了一個『單峰映象』 logistic map 遞迴關係式 x_{n+1} = r x_n(1 - x_n), \ 0\leq x_n <1。這個遞迴關係式很像是『差分版』的『 logistic equation』,竟然是產生『混沌現象』的經典範例。假使說一個『遞迴關係式』有『極限值x_{\infty} = x_H 的話,此時 x_H = r x_H(1-x_H),可以得到 r{x_H}^2 = (r - 1) x_H,於是 x_H \approx 0 或者 x_H \approx \frac{r - 1}{r}。在 r < 1 之時,『單峰映象』或快或慢的收斂到『』; 當 1 < r < 2 之時,它很快的逼近 \frac{r - 1}{r};於 2 < r < 3 之時,線性的上下震盪趨近 \frac{r - 1}{r};雖然 r=3 也收斂到 \frac{r - 1}{r},然而已經是很緩慢而且不是線性的了;當 r > 1 + \sqrt{6} \approx 3.45 時,對幾乎各個『初始條件』而言,系統開始發生兩值『震盪現象』,而後變成四值、八值、十六值…等等的『持續震盪』;最終於大約 r = 3.5699 時,這個震盪現象消失了,系統就步入了所謂的『混沌狀態』的了!!

連續的』微分方程式沒有『混沌性』,『離散的』差分方程式反倒發生了『混沌現象』,那麼這個『量子』的『宇宙』到底是不是『混沌』的呢??回想之前『λ 運算』裡的『遞迴函式』,與數學中的『定點』定義,『單峰映象』可以看成函數 f(x) = r \cdot x(1 - x) 的『迭代求值』︰x_1 = f(x_0), x_2 = f(x_1), \cdots x_{k+1} = f(x_k) \cdots。當 f^{(p)} (x_f) = f \cdots p -2 times f \cdots f(x_f) = x_f,這個 x_f 就是『定點』,左圖中顯示出不同的 r 值的求解現象,從有『定點』向『震盪』到『混沌』。如果我們將『 logistic equation』 改寫成 \Delta P(t) = P(t + \Delta t) - P(t) = \left( r P(t) \left[ 1 - P(t) \right]  \right) \cdot \Delta t,假使取 t = n \Delta t, \Delta t = 1,可以得到 P(n + 1) - P(n) =  r P(n) \left[ 1 - P(n) \right],它的『極限值P(H) \approx 0, 1,根本與 r 沒有關係,這也就說明了兩者的『根源』是不同的啊!然而這卻建議著一種『時間序列』的觀點,如將 x_n 看成 x(n \Delta t), \ \Delta t = 1,這樣 \frac{x[(n+1) \Delta t]  - x[n \Delta t]}{\Delta t} = x_{n+1} - x_n 就說是『速度』的了,於是 (x_n, x_{n+1} - x_n) 便構成了假想的『相空間』,這可就把一個『遞迴關係式』轉譯成了一種『符號動力學』的了!!

在某些特定的 r 值,這個『遞迴關係式』有『正確解』 exact solution,比方說 r=2 時,x_n = \frac{1}{2} - \frac{1}{2}(1-2x_0)^{2^{n}},因為 x_0 \in [0,1),所以 (1-2x_0)\in (-1,1),於是 n \approx \infty \Longrightarrow (1-2x_0)^{2^{n}} \approx 0,因此 x_H \approx \frac{1}{2}。再者由於『指數項2^n 是『偶數』,所以此『符號動力系統』不等速 ── 非線性 ── 而且不震盪的逼近『極限值』的啊。

── 摘自《【Sonic π】電路學之補充《四》無窮小算術‧中下上