W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】四

2016-06-30 懸鉤子

有人淘寶，有人墾荒。棋局方開，天下初定。何不大膽採用新法，小心驗證虛實乎？？！！

『邏輯學』上說『有□則有○，無○則無□』，既已『有□』又想『無○』，哪裡能夠不矛盾的啊！過去魏晉時『王弼』講︰一，數之始而物之極也。謂之為妙有者，欲言有，不見其形，則非有，故謂之妙；欲言其無，物由之以生，則非無，故謂之有也。斯乃無中之有，謂之妙有也。假使用『恆等式』 $1 - x^n = (1 - x)(1 + x + \cdots + x^{n-1})$ 來計算 $\frac{1 + x + \cdots + x^{m-1}}{1 + x + \cdots + x^{n-1}}$ ，將等於 $\frac{1 - x^m}{1 - x^n} = (1 - x^m) \left[1 + (x^n) + { (x^n) }^2 + { (x^n) } ^3 + \cdots \right]$ $= 1 - x^m + x^n - x^{n+m} + x^{2n} - \cdots$ ，那麼 $1 - 1 + 1 - 1 + \cdots$ 難道不應該『等於』 $\frac{m}{n}$ 的嗎？一七四三年時，『伯努利』正因此而反對『歐拉』所講的『可加性』說法，『同』一個級數怎麼可能有『不同』的『和』的呢？？作者不知如果在太空裡，乘坐著『加速度』是 $g$ 的太空船，在上面用著『樹莓派』控制的『奈米手』來擲『骰子』，是否一定能得到『相同點數』呢？難道說『牛頓力學』不是只要『初始態』是『相同』的話，那個『骰子』的『軌跡』必然就是『一樣』的嗎？？據聞，法國籍義大利裔大數學家『約瑟夫 ‧拉格朗日』伯爵 Joseph Lagrange 倒是有個『說法』︰事實上，對於『不同』的 $m,n$ 來講，從『幂級數』來看，那個 $= 1 - x^m + x^n - x^{n+m} + x^{2n} - \cdots$ 是有『零的間隙』的 $1 + 0 + 0 + \cdots - 1 + 0 + 0 + \cdots$ ，這就與 $1 - 1 + 1 - 1 + \cdots$ 『形式』上『不同』，我們怎麼能『先驗』的『期望』結果會是『相同』的呢！！

假使我們將『幾何級數』 $1 + z + z^2 + \cdots + z^n + \cdots = \frac{1}{1 - z}$ ，擺放到『複數平面』之『單位圓』上來『研究』，輔之以『歐拉公式』 $z = e^{i \theta} = \cos \theta + i\sin \theta$ ，或許可以略探『可加性』理論的『意指』。當 $0 < \theta < 2 \pi$ 時， $\cos \theta \neq 1$ ，雖然 $|e^{i \theta}| = 1$ ，我們假設那個『幾何級數』會收斂，於是得到 $1 + e^{i \theta} + e^{2i \theta} + \cdots = \frac{1}{1 - e^{i \theta}}$ $= \frac{1}{2} + \frac{1}{2} i \cot \frac{\theta}{2}$ ，所以 $\frac{1}{2} + \cos{\theta} + \cos{2\theta} + \cos{3\theta} + \cdots = 0$ 以及 $\sin{\theta} + \sin{2\theta} + \sin{3\theta} + \cdots = \frac{1}{2} \cot \frac{\theta}{2}$ 。如果我們用 $\theta = \phi + \pi$ 來『代換』，此時 $-\pi < \phi < \pi$ ，可以得到【一】 $\frac{1}{2} - \cos{\phi} + \cos{2\phi} - \cos{3\phi} + \cdots = 0$ 和【二】 $\sin{\phi} - \sin{2\phi} + \sin{3\phi} - \cdots = \frac{1}{2} \tan \frac{\phi}{2}$ 。要是在【一】式中將 $\phi$ 設為『零』的話，我們依然會有 $1 - 1 + 1 - 1 + \cdots = \frac{1}{2}$ ；要是驗之以【二】式，當 $\phi = \frac{\pi}{2}$ 時，原式可以寫成 $1 - 0 - 1 - 0 + 1 - 0 - 1 - 0 + \cdots = \frac{1}{2}$ 。如此看來 $s = 1 + z + z^2 + z^3 + \cdots = 1 +z s$ 的『形式運算』，可能是有更深層的『關聯性』的吧！！

複數平面之單位圓

假使我們將【二】式對 $\phi$ 作『逐項微分』得到 $\cos{\phi} - 2\cos{2\phi} + 3\cos{3\phi} - \cdots = \frac{1}{4} \frac{1}{{(\cos \frac{\phi}{2})}^2}$ ，此時令 $\phi = 0$ ，就得到 $1 - 2 + 3 - 4 + 5 - \cdots = \frac{1}{4}$ 。如果把【一】式改寫成 $\cos{\phi} - \cos{2\phi} + \cos{3\phi} - \cdots = \frac{1}{2}$ 然後對 $\phi$ 作『逐項積分』 $\int \limits_{0}^{\theta}$ ，並將變數 $\theta$ 改回 $\phi$ 後得到 $\sin{\phi} - \frac{\sin{2\phi}}{2} + \frac{\sin{3\phi}}{3} - \cdots = \frac{\phi}{2}$ ；再做一次作『逐項積分』 $\int \limits_{0}^{\theta}$ ，且將變數 $\theta$ 改回 $\phi$ 後將得到 $1 - \cos{\phi} - \frac{1 - \cos{2\phi}}{2^2} + \frac{1 - \cos{3\phi}}{3^2} - \cdots = \frac{\phi^2}{4}$ ，於是當 $\phi = \pi$ 時， $1 + \frac{1}{3^2} + \frac{1}{5^2} + \cdots = \frac{\pi^2}{8}$ 。然而 $1 + \frac{1}{3^2} + \frac{1}{5^2} + \cdots = [1 + \frac{1}{2^2} + \frac{1}{3^2} + \frac{1}{4^2} + \frac{1}{5^2} + \cdots] - [\frac{1}{2^2} + \frac{1}{4^2} + \frac{1}{6^2} + \cdots]$ $=[1 - \frac{1}{4}][1 + \frac{1}{2^2} + \frac{1}{3^2} + \frac{1}{4^2} + \frac{1}{5^2} + \cdots]$ ，如此我們就能得到了『巴塞爾問題』的答案 $\sum \limits_{n=1}^{\infty}\frac{1}{n^2} = \frac{\pi^2}{6}$ 。那麼

$S= \ \ 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + \cdots$ 減
$4S=\ \ \ \ \ \ 4 + \ \ \ \ \ 8 + \ \ \ \ \ 12 + \cdots$ 等於
$-3S= 1 - 2 + 3 - 4 + 5 - 6 + \cdots = \frac{1}{4}$ ，所以 $S = - \frac{1}{12}$ 。

但是這樣的作法果真是有『道理』的嗎？假使按造『級數的極限』之『定義』，如果『部份和』 $S_n = \sum \limits_{k=0}^{n} a_n$ 之『極限』 $S = \lim \limits_{n \to \infty} S_n$ 存在， $S$ 能不滿足 $S = a_0 + a_1 + a_2 + a_3 + \cdots = a_0 + (S - a_0)$ 的嗎？或者可以是 $\sum \limits_{n=0}^{\infty} k \cdot a_n \neq k \cdot S$ 的呢？即使又已知 $S^{\prime} = \sum \limits_{n=0}^{\infty} b_n$ ，還是說可能會發生 $\sum \limits_{n=0}^{\infty} a_n + b_n \neq S + S^{\prime}$ 的哩！若是說那些都不會發生，所謂的『可加性』的『概念』應當就可以看成『擴大』且包含『舊有』的『級數的極限』的『觀點』的吧！也許我們應當使用別種『記號法』來『表達』它，以免像直接寫作 $1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + \cdots = - \frac{1}{12}$ 般的容易引起『誤解』，畢竟是也存在著多種『可加法』的啊！至於說那些『可加法』的『意義詮釋』，就看『使用者』的吧！！

在此僅略為補充，『複數函數』 $f(z) = \frac{1}{1 -z}$ 除了 $z = 1$ 是『不連續』外，而『幾何級數』 $1 + z + z^2 + \cdots + z^n + \cdots = \frac{1}{1 - z}$ 在 $|z| < 1$ 『都收斂』，因是 $\lim \limits_{z \to |z_1^{-}| = 1^{-}} 1 + z + z^2 + \cdots + z^n + \cdots = f(z_1)$ 。也就是說『連續性』、『泰勒展開式』與『級數求和』等等之間有極深的『聯繫』，事實上它也與『定點理論』 $f(x) = x$ 之『關係』微妙的很啊！！

一九四八年時，荷蘭物理學家『亨德里克‧卡西米爾』 Hendrik Casimir 提出了『真空不空』的『議論』。因為依據『量子場論』，『真空』也得有『最低能階』，因此『真空能量』不論因不因其『實虛』粒子之『生滅』，總得有一個『量子態』。由於已知『原子』與『分子』的『主要結合力』是『電磁力』，那麼該『如何』說『真空』之『量化』與『物質』的『實際』是怎麽來『配合』的呢？因此他『計算』了這個『可能效應』之『大小』，然而無論是哪種『震盪』所引起的，他總是得要面臨『無窮共振態』 $\langle E \rangle = \frac{1}{2} \sum \limits_{n}^{\infty} E_n$ 的『問題』，這也就是說『平均』有『多少』各種能量的『光子？』所參與 $h\nu + 2h\nu + 3h\nu + \cdots$ 的『問題』？據知『卡西米爾』用『歐拉』等之『可加法』，得到了 ${F_c \over A} = -\frac {\hbar c \pi^2} {240 a^4}$ 。

此處之『負』 $-$ 代表『吸引力』，而今早也已經『證實』的了，真不知『宇宙』是果真先就有『計畫』的嗎？還是說『人們』自己還在『幻想』的呢？？

─── 摘自《【Sonic π】電聲學之電路學《四》之《 V!》‧下》

或可得 Michael Nielsen 先生預示之神經網絡光明前景耶！！？？

Other obstacles to deep learning

In this chapter we’ve focused on vanishing gradients – and, more generally, unstable gradients – as an obstacle to deep learning. In fact, unstable gradients are just one obstacle to deep learning, albeit an important fundamental obstacle. Much ongoing research aims to better understand the challenges that can occur when training deep networks. I won’t comprehensively summarize that work here, but just want to briefly mention a couple of papers, to give you the flavor of some of the questions people are asking.

As a first example, in 2010 Glorot and Bengio*

*Understanding the difficulty of training deep feedforward neural networks, by Xavier Glorot and Yoshua Bengio (2010). See also the earlier discussion of the use of sigmoids in Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998).

found evidence suggesting that the use of sigmoid activation functions can cause problems training deep networks. In particular, they found evidence that the use of sigmoids will cause the activations in the final hidden layer to saturate near $0$ $0$ early in training, substantially slowing down learning. They suggested some alternative activation functions, which appear not to suffer as much from this saturation problem.

As a second example, in 2013 Sutskever, Martens, Dahl and Hinton*

*On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton (2013).

studied the impact on deep learning of both the random weight initialization and the momentum schedule in momentum-based stochastic gradient descent. In both cases, making good choices made a substantial difference in the ability to train deep networks.

These examples suggest that “What makes deep networks hard to train?” is a complex question. In this chapter, we’ve focused on the instabilities associated to gradient-based learning in deep networks. The results in the last two paragraphs suggest that there is also a role played by the choice of activation function, the way weights are initialized, and even details of how learning by gradient descent is implemented. And, of course, choice of network architecture and other hyper-parameters is also important. Thus, many factors can play a role in making deep networks hard to train, and understanding all those factors is still a subject of ongoing research. This all seems rather downbeat and pessimism-inducing. But the good news is that in the next chapter we’ll turn that around, and develop several approaches to deep learning that to some extent manage to overcome or route around all these challenges.

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】三

2016-06-29 懸鉤子

是否可能『無限邊界』只包圍『有限面積』？？難道真的『重物』跑得比『輕物』較快！！人因懷疑而論辯，因無法實驗而思想︰

人們『直覺』上都知道，『線』是一維的、『面』是二維的以及『體』是三維的，也許同意『點』是零維的。但是要怎麽定義『維度』呢？假使設想用一根有刻度的 $R$ 尺，量一條線段，得到 $l$ $\eqcirc$ ── 單位刻度 ──，如果有另一根 $R^'$ 的尺，它的單位刻度 $\eqcirc^'$ 是 $R$ 尺的 $\lambda$ 倍，也就是說 $\eqcirc^{'} \ = \lambda \cdot \eqcirc$ ，那用這 $R^'$ 之尺來量該線段將會得到 $l \cdot \lambda^{-1} \eqcirc^'$ 。同樣的如果 $R$ 尺量一個正方形得到數值 $l^2$ ，那用 $R^'$ 之尺來量就會得到數值 $l^2 \cdot \lambda^{-2}$ ，這樣那 $R^{' }, R$ 兩尺的度量之數值比 $N = \lambda^{-2}$ 。
於是德國數學家費利克斯‧豪斯多夫 Felix Hausdorff 是這樣子定義『維度』 $D$ 的︰

$-D =\lim\limits_{\lambda \to 0} \frac{log N} {log \lambda}$

，使它也能適用於『分形』的『分維』。

在科赫雪花的建構過程中，從最初 $n=0$ 的 △，到第 $n$ 步時︰

總邊數︰ $N_n = 3 \cdot 4^n$
單一邊長︰ $L_n = (\frac {1}{3})^n$
總周長︰ $l_n = 3 \cdot (\frac {4}{3})^n$
圍繞面積︰ $A_n = \frac {1}{5} (8 - 3 (\frac {4}{9})^n) \triangle$

因此科赫雪花的分維是

$-\lim\limits_{n \to \infty} \frac{log N_n} {log L_n} = \frac{log 4}{log 3} = 1.261859507\cdots$

，它所圍繞的極限面積 $A_{\infty} = \frac{8}{5} \triangle$ ，那時的總周長 $l_{\infty} = \infty$ 。

人類天性中的直覺是認識事物、判斷是非和分辨真假非常重要的能力，但是它通常是『素樸的』，所以『概念的分析』也往往是一件『必要的』事。雖然面對『錯綜複雜』的現象，未經粹煉的『思想』容易誤入歧途，但是『直覺理解』的事物，常常流於『想當然耳』之過失。伽利略老年著作了一本『Discorsi e dimostrazioni matematiche』Mathematical Discourses and Demonstrations，據說是『思想實驗』的起始處︰
……
Salviati︰ If then we take two bodies whose natural speeds are different, it is clear that on uniting the two, the more rapid one will be partly retarded by the slower, and the slower will be somewhat hastened by the swifter. Do you not agree with me in this opinion?

Simplicio.︰You are unquestionably right.

Salviati.︰But if this is true, and if a large stone moves with a speed of, say, eight while a smaller moves with a speed of four, then when they are united, the system will move with a speed less than eight; but the two stones when tied together make a stone larger than that which before moved with a speed of eight. Hence the heavier body moves with less speed than the lighter; an effect which is contrary to your supposition. Thus you see how, from your assumption that the heavier body moves more rapidly than ‘ the lighter one, I infer that the heavier body moves more slowly.
……
這段對話用著想像分析、邏輯推導、討論演示著亞里斯多德的『重的東西動得快』的錯誤。概略的說，如果□與○二物重量不同，往同一個方向運動，□也跑得比較快，設想用一條繩索將這兩者連在一起，那快的應該會被拖慢，而慢的將會被拉快的吧。假使這是件事實，這樣『□+○』整體來看不是更重的東西嗎？難道它不與『重的東西動得快』矛盾的嗎？？

古希臘哲學家埃利亞的芝諾 Ζήνων 宣稱︰即使是參與了特洛伊戰爭古希臘神話中的阿基里斯 Achilles ── 希臘第一勇士 ── 也追不上領先一步之烏龜，因為

動得再慢的物體也不可能被動得更快的物體追上。因為追逐者總得先抵達被逐者上個出發點，然而被逐者此刻早已經前往下個所在點。於是被逐者永遠在追逐者之前一步。

『運動』的概念即使能直覺掌握，卻從來都不是簡單的，就像所有瞬刻都可以說它是『靜止』的『箭矢』，又為什麼不能歸結出『動矢不動』的結論的呢？？

─── 摘自《思想實驗！！》

如是者自然喜讀 Michael Nielsen 先生一氣呵成之大哉論也︰

What’s causing the vanishing gradient problem? Unstable gradients in deep neural nets

To get insight into why the vanishing gradient problem occurs, let’s consider the simplest deep neural network: one with just a single neuron in each layer. Here’s a network with three hidden layers:

Here, $w_1, w_2, \ldots$

w_{1}, w_{2}, \dots

are the weights, $b_1, b_2, \ldots$

b_{1}, b_{2}, \dots

are the biases, and $C$

C

is some cost function. Just to remind you how this works, the output $a_j$

a_{j}

from the $j$

j

th neuron is $\sigma(z_j)$

σ (z_{j})

, where $\sigma$

σ

is the usual sigmoid activation function, and $z_j = w_{j} a_{j-1}+b_j$

z_{j} = w_{j} a_{j - 1} + b_{j}

is the weighted input to the neuron. I’ve drawn the cost $C$ at the end to emphasize that the cost is a function of the network’s output, $a_4$

a_{4}

: if the actual output from the network is close to the desired output, then the cost will be low, while if it’s far away, the cost will be high.

We’re going to study the gradient $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ associated to the first hidden neuron. We’ll figure out an expression for $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ , and by studying that expression we’ll understand why the vanishing gradient problem occurs.

I’ll start by simply showing you the expression for $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ . It looks forbidding, but it’s actually got a simple structure, which I’ll describe in a moment. Here’s the expression (ignore the network, for now, and note that $\sigma'$ $σ^{'}$ is just the derivative of the $\sigma$ $σ$ function):

The structure in the expression is as follows: there is a $\sigma'(z_j)$

σ^{'} (z_{j})

term in the product for each neuron in the network; a weight $w_j$

w_{j}

term for each weight in the network; and a final $\partial C / \partial a_4$

\partial C / \partial a_{4}

term, corresponding to the cost function at the end. Notice that I’ve placed each term in the expression above the corresponding part of the network. So the network itself is a mnemonic for the expression.

You’re welcome to take this expression for granted, and skip to the discussion of how it relates to the vanishing gradient problem. There’s no harm in doing this, since the expression is a special case of our earlier discussion of backpropagation. But there’s also a simple explanation of why the expression is true, and so it’s fun (and perhaps enlightening) to take a look at that explanation.

Imagine we make a small change $\Delta b_1$ $Δ b_{1}$ in the bias $b_1$ $b_{1}$ . That will set off a cascading series of changes in the rest of the network. First, it causes a change $\Delta a_1$ $Δ a_{1}$ in the output from the first hidden neuron. That, in turn, will cause a change $\Delta z_2$ $Δ z_{2}$ in the weighted input to the second hidden neuron. Then a change $\Delta a_2$ $Δ a_{2}$ in the output from the second hidden neuron. And so on, all the way through to a change $\Delta C$ $Δ C$ in the cost at the output. We have

$\frac{\partial C}{\partial b_1} \approx \frac{\Delta C}{\Delta b_1}. \ \ \ \ (114)$

This suggests that we can figure out an expression for the gradient $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ by carefully tracking the effect of each step in this cascade.

To do this, let’s think about how $\Delta b_1$ $Δ b_{1}$ causes the output $a_1$ $a_{1}$ from the first hidden neuron to change. We have $a_1 = \sigma(z_1) = \sigma(w_1 a_0 + b_1)$ $a_{1} = σ (z_{1}) = σ (w_{1} a_{0} + b_{1})$ , so

\Delta a_1 \approx \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \ \ \ \ (115)

= \sigma'(z_1) \Delta b_1. \ \ \ \ (116)

That $\sigma'(z_1)$ $σ^{'} (z_{1})$ term should look familiar: it’s the first term in our claimed expression for the gradient $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ . Intuitively, this term converts a change $\Delta b_1$ $Δ b_{1}$ in the bias into a change $\Delta a_1$ $Δ a_{1}$ in the output activation. That change $\Delta a_1$ $Δ a_{1}$ in turn causes a change in the weighted input $z_2 = w_2 a_1 + b_2$ $z_{2} = w_{2} a_{1} + b_{2}$ to the second hidden neuron:

\Delta z_2 \approx \frac{\partial z_2}{\partial a_1} \Delta a_1 \ \ \ \ (117)

= w_2 \Delta a_1. \ \ \ \ (118)

Combining our expressions for $\Delta z_2$ $Δ z_{2}$ and $\Delta a_1$ $Δ a_{1}$ , we see how the change in the bias $b_1$ $b_{1}$ propagates along the network to affect $z_2$ $z_{2}$ :

\Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1. \ \ \ \ (119)

Again, that should look familiar: we’ve now got the first two terms in our claimed expression for the gradient $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ .

We can keep going in this fashion, tracking the way changes propagate through the rest of the network. At each neuron we pick up a $\sigma'(z_j)$ $σ^{'} (z_{j})$ term, and through each weight we pick up a $w_j$ $w_{j}$ term. The end result is an expression relating the final change $\Delta C$ $Δ C$ in cost to the initial change $Δ b_{1}$ in the bias:

\Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4} \Delta b_1. \ \ \ \ (120)

Dividing by $\Delta b_1$ $Δ b_{1}$ we do indeed get the desired expression for the gradient:

\frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4}. \ \ \ \ (121)

Why the vanishing gradient problem occurs: To understand why the vanishing gradient problem occurs, let’s explicitly write out the entire expression for the gradient:

\frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}. \ \ \ \ (122)

Excepting the very last term, this expression is a product of terms of the form $w_j \sigma'(z_j)$ $w_{j} σ^{'} (z_{j})$ . To understand how each of those terms behave, let’s look at a plot of the function $\sigma'$ $σ^{'}$ :

vgp-s

The derivative reaches a maximum at $\sigma'(0) = 1/4$ $σ^{'} (0) = 1 / 4$ . Now, if we use our standard approach to initializing the weights in the network, then we’ll choose the weights using a Gaussian with mean $0$ $0$ and standard deviation $1$ $1$ . So the weights will usually satisfy $|w_j| < 1$ $| w_{j} | < 1$ . Putting these observations together, we see that the terms $w_j \sigma'(z_j)$ $w_{j} σ^{'} (z_{j})$ will usually satisfy $|w_j \sigma'(z_j)| < 1/4$ $| w_{j} σ^{'} (z_{j}) | < 1 / 4$ . And when we take a product of many such terms, the product will tend to exponentially decrease: the more terms, the smaller the product will be. This is starting to smell like a possible explanation for the vanishing gradient problem.

To make this all a bit more explicit, let’s compare the expression for $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ to an expression for the gradient with respect to a later bias, say $\partial C / \partial b_3$ $\partial C / \partial b_{3}$ . Of course, we haven’t explicitly worked out an expression for $\partial C / \partial b_3$ $\partial C / \partial b_{3}$ , but it follows the same pattern described above for $\partial C / \partial b_1$ $\partial C / \partial b_{1}$ . Here’s the comparison of the two expressions:

The two expressions share many terms. But the gradient $\partial C / \partial b_1$

\partial C / \partial b_{1}

includes two extra terms each of the form $w_j \sigma'(z_j)$

w_{j} σ^{'} (z_{j})

. As we’ve seen, such terms are typically less than $1/4$

1 / 4

in magnitude. And so the gradient $\partial C / \partial b_1$

\partial C / \partial b_{1}

will usually be a factor of $16$

16

(or more) smaller than $\partial C / \partial b_3$

\partial C / \partial b_{3}

. This is the essential origin of the vanishing gradient problem.

Of course, this is an informal argument, not a rigorous proof that the vanishing gradient problem will occur. There are several possible escape clauses. In particular, we might wonder whether the weights $w_j$ $w_{j}$ could grow during training. If they do, it’s possible the terms $w_j \sigma'(z_j)$ $w_{j} σ^{'} (z_{j})$ in the product will no longer satisfy $|w_j \sigma'(z_j)| < 1/4$ $| w_{j} σ^{'} (z_{j}) | < 1 / 4$ . Indeed, if the terms get large enough – greater than $1$ $1$ – then we will no longer have a vanishing gradient problem. Instead, the gradient will actually grow exponentially as we move backward through the layers. Instead of a vanishing gradient problem, we’ll have an exploding gradient problem.

The exploding gradient problem: Let’s look at an explicit example where exploding gradients occur. The example is somewhat contrived: I’m going to fix parameters in the network in just the right way to ensure we get an exploding gradient. But even though the example is contrived, it has the virtue of firmly establishing that exploding gradients aren’t merely a hypothetical possibility, they really can happen.

There are two steps to getting an exploding gradient. First, we choose all the weights in the network to be large, say $w_1 = w_2 = w_3 = w_4 = 100$ $w_{1} = w_{2} = w_{3} = w_{4} = 100$ . Second, we’ll choose the biases so that the $\sigma'(z_j)$ $σ^{'} (z_{j})$ terms are not too small. That’s actually pretty easy to do: all we need do is choose the biases to ensure that the weighted input to each neuron is $z_j = 0$ $z_{j} = 0$ (and so $\sigma'(z_j) = 1/4$ . So, for instance, we want $z_1 = w_1 a_0 + b_1 = 0$ $z_{1} = w_{1} a_{0} + b_{1} = 0$ . We can achieve this by setting $b_1 = -100 * a_0$ $b_{1} = - 100 * a_{0}$ . We can use the same idea to select the other biases. When we do this, we see that all the terms $w_j \sigma'(z_j)$ $w_{j} σ^{'} (z_{j})$ are equal to $100 * \frac{1}{4} = 25$ $100 * \frac{1}{4} = 25$ . With these choices we get an exploding gradient.

The unstable gradient problem: The fundamental problem here isn’t so much the vanishing gradient problem or the exploding gradient problem. It’s that the gradient in early layers is the product of terms from all the later layers. When there are many layers, that’s an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out. Without some mechanism or underlying reason for that balancing to occur, it’s highly unlikely to happen simply by chance. In short, the real problem here is that neural networks suffer from an unstable gradient problem. As a result, if we use standard gradient-based learning techniques, different layers in the network will tend to learn at wildly different speeds.

The prevalence of the vanishing gradient problem: We’ve seen that the gradient can either vanish or explode in the early layers of a deep network. In fact, when using sigmoid neurons the gradient will usually vanish. To see why, consider again the expression $|w \sigma'(z)|$ $| w σ^{'} (z) |$ . To avoid the vanishing gradient problem we need $|w \sigma'(z)| \geq 1$ $| w σ^{'} (z) | \geq 1$ . You might think this could happen easily if $w$ $w$ is very large. However, it’s more difficult than it looks. The reason is that the $\sigma'(z)$ $σ^{'} (z)$ term also depends on $w$ $w$ : $\sigma'(z) = \sigma'(wa +b)$ $σ^{'} (z) = σ^{'} (w a + b)$ , where $a$ is the input activation. So when we make $w$ large, we need to be careful that we’re not simultaneously making $\sigma'(wa+b)$ $σ^{'} (w a + b)$ small. That turns out to be a considerable constraint. The reason is that when we make $w$ $w$ large we tend to make $wa+b$ $w a + b$ very large. Looking at the graph of $\sigma'$ $σ^{'}$ you can see that this puts us off in the “wings” of the $\sigma'$ $σ^{'}$ function, where it takes very small values. The only way to avoid this is if the input activation falls within a fairly narrow range of values (this qualitative explanation is made quantitative in the first problem below). Sometimes that will chance to happen. More often, though, it does not happen. And so in the generic case we have vanishing gradients.

Unstable gradients in more complex networks

We’ve been studying toy networks, with just one neuron in each hidden layer. What about more complex deep networks, with many neurons in each hidden layer?

In fact, much the same behaviour occurs in such networks. In the earlier chapter on backpropagation we saw that the gradient in the $l$

l

th layer of an $L$

L

layer network is given by:

\delta^l = \Sigma'(z^l) (w^{l+1})^T \Sigma'(z^{l+1}) (w^{l+2})^T \ldots \Sigma'(z^L) \nabla_a C \ \ \ \ (124)

Here, $\Sigma'(z^l)$ $Σ^{'} (z^{l})$ is a diagonal matrix whose entries are the $\sigma'(z)$ $σ^{'} (z)$ values for the weighted inputs to the $l$ th layer. The $w^l$ $w^{l}$ are the weight matrices for the different layers. And $\nabla_a C$ $\nabla_{a} C$ is the vector of partial derivatives of $C$ $C$ with respect to the output activations.

This is a much more complicated expression than in the single-neuron case. Still, if you look closely, the essential form is very similar, with lots of pairs of the form $(w^j)^T \Sigma'(z^j)$ $(w^{j})^{T} Σ^{'} (z^{j})$ . What’s more, the matrices $\Sigma'(z^j)$ $Σ^{'} (z^{j})$ have small entries on the diagonal, none larger than $\frac{1}{4}$ $\frac{1}{4}$ . Provided the weight matrices $w^{j}$ aren’t too large, each additional term $(w^j)^T \Sigma'(z^l)$ $(w^{j})^{T} Σ^{'} (z^{l})$ tends to make the gradient vector smaller, leading to a vanishing gradient. More generally, the large number of terms in the product tends to lead to an unstable gradient, just as in our earlier example. In practice, empirically it is typically found in sigmoid networks that gradients vanish exponentially quickly in earlier layers. As a result, learning slows down in those layers. This slowdown isn’t merely an accident or an inconvenience: it’s a fundamental consequence of the approach we’re taking to learning.

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】二

2016-06-28 懸鉤子

洋洋灑灑之文，自然成章，根本無可註解。但請跟著 Michael Nielsen 先生來趟發現之旅耶！！

The vanishing gradient problem

So, what goes wrong when we try to train a deep network?

To answer that question, let’s first revisit the case of a network with just a single hidden layer. As per usual, we’ll use the MNIST digit classification problem as our playground for learning and experimentation*

*I introduced the MNIST problem and data here and here..

If you wish, you can follow along by training networks on your computer. It is also, of course, fine to just read along. If you do wish to follow live, then you’ll need Python 2.7, Numpy, and a copy of the code, which you can get by cloning the relevant repository from the command line:

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

If you don’t use git then you can download the data and code here. You’ll need to change into the src subdirectory.

Then, from a Python shell we load the MNIST data:

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()

We set up our network:

>>> import network2
>>> net = network2.Network([784, 30, 10])

This network has 784 neurons in the input layer, corresponding to the $28 \times 28 = 784$ $28 \times 28 = 784$ pixels in the input image. We use 30 hidden neurons, as well as 10 output neurons, corresponding to the 10 possible classifications for the MNIST digits (‘0’, ‘1’, ‘2’, $\ldots$ , ‘9’).

Let’s try training our network for 30 complete epochs, using mini-batches of 10 training examples at a time, a learning rate $\eta = 0.1$ $η = 0.1$ , and regularization parameter $\lambda = 5.0$ $λ = 5.0$ . As we train we’ll monitor the classification accuracy on the validation_data*

*Note that the networks is likely to take some minutes to train, depending on the speed of your machine. So if you’re running the code you may wish to continue reading and return later, not wait for the code to finish executing.:

>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

We get a classification accuracy of 96.48 percent (or thereabouts – it’ll vary a bit from run to run), comparable to our earlier results with a similar configuration.

Now, let’s add another hidden layer, also with 30 neurons in it, and try training with the same hyper-parameters:

>>> net = network2.Network([784, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

This gives an improved classification accuracy, 96.90 percent. That’s encouraging: a little more depth is helping. Let’s add another 30-neuron hidden layer:

>>> net = network2.Network([784, 30, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

That doesn’t help at all. In fact, the result drops back down to 96.57 percent, close to our original shallow network. And suppose we insert one further hidden layer:

>>> net = network2.Network([784, 30, 30, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)

The classification accuracy drops again, to 96.53 percent. That’s probably not a statistically significant drop, but it’s not encouraging, either.

This behaviour seems strange. Intuitively, extra hidden layers ought to make the network able to learn more complex classification functions, and thus do a better job classifying. Certainly, things shouldn’t get worse, since the extra layers can, in the worst case, simply do nothing*

*See this later problem to understand how to build a hidden layer that does nothing.

. But that’s not what’s going on.

So what is going on? Let’s assume that the extra hidden layers really could help in principle, and the problem is that our learning algorithm isn’t finding the right weights and biases. We’d like to figure out what’s going wrong in our learning algorithm, and how to do better.

To get some insight into what’s going wrong, let’s visualize how the network learns. Below, I’ve plotted part of a $[784, 30, 30, 10]$ network, i.e., a network with two hidden layers, each containing $30$ hidden neurons. Each neuron in the diagram has a little bar on it, representing how quickly that neuron is changing as the network learns. A big bar means the neuron’s weights and bias are changing rapidly, while a small bar means the weights and bias are changing slowly. More precisely, the bars denote the gradient $\partial C / \partial b$ $\partial C / \partial b$ for each neuron, i.e., the rate of change of the cost with respect to the neuron’s bias. Back in Chapter 2 we saw that this gradient quantity controlled not just how rapidly the bias changes during learning, but also how rapidly the weights input to the neuron change, too. Don’t worry if you don’t recall the details: the thing to keep in mind is simply that these bars show how quickly each neuron’s weights and bias are changing as the network learns.

To keep the diagram simple, I’ve shown just the top six neurons in the two hidden layers. I’ve omitted the input neurons, since they’ve got no weights or biases to learn. I’ve also omitted the output neurons, since we’re doing layer-wise comparisons, and it makes most sense to compare layers with the same number of neurons. The results are plotted at the very beginning of training, i.e., immediately after the network is initialized. Here they are*

*The data plotted is generated using the program generate_gradient.py. The same program is also used to generate the results quoted later in this section.:

vgd1

The network was initialized randomly, and so it’s not surprising that there’s a lot of variation in how rapidly the neurons learn. Still, one thing that jumps out is that the bars in the second hidden layer are mostly much larger than the bars in the first hidden layer. As a result, the neurons in the second hidden layer will learn quite a bit faster than the neurons in the first hidden layer. Is this merely a coincidence, or are the neurons in the second hidden layer likely to learn faster than neurons in the first hidden layer in general?

To determine whether this is the case, it helps to have a global way of comparing the speed of learning in the first and second hidden layers. To do this, let’s denote the gradient as $\delta^l_j = \partial C / \partial b^l_j$ $δ_{j}^{l} = \partial C / \partial b_{j}^{l}$ , i.e., the gradient for the $j$ $j$ th neuron in the $l$ $l$ th layer*

*Back in Chapter 2 we referred to this as the error, but here we’ll adopt the informal term “gradient”. I say “informal” because of course this doesn’t explicitly include the partial derivatives of the cost with respect to the weights, $\partial C / \partial w$ $\partial C / \partial w$ .

. We can think of the gradient $\delta^1$ $δ^{1}$ as a vector whose entries determine how quickly the first hidden layer learns, and $\delta^2$ $δ^{2}$ as a vector whose entries determine how quickly the second hidden layer learns. We’ll then use the lengths of these vectors as (rough!) global measures of the speed at which the layers are learning. So, for instance, the length $\| \delta^1 \|$ $∥ δ^{1} ∥$ measures the speed at which the first hidden layer is learning, while the length $\| \delta^2 \|$ measures the speed at which the second hidden layer is learning.

With these definitions, and in the same configuration as was plotted above, we find $\| \delta^1 \| = 0.07\ldots$ $∥ δ^{1} ∥ = 0.07 \dots$ and $\| \delta^2 \| = 0.31\ldots$ $∥ δ^{2} ∥ = 0.31 \dots$ . So this confirms our earlier suspicion: the neurons in the second hidden layer really are learning much faster than the neurons in the first hidden layer.

What happens if we add more hidden layers? If we have three hidden layers, in a $[784, 30, 30, 30, 10]$ network, then the respective speeds of learning turn out to be 0.012, 0.060, and 0.283. Again, earlier hidden layers are learning much slower than later hidden layers. Suppose we add yet another layer with $30$ hidden neurons. In that case, the respective speeds of learning are 0.003, 0.017, 0.070, and 0.285. The pattern holds: early layers learn slower than later layers.

We’ve been looking at the speed of learning at the start of training, that is, just after the networks are initialized. How does the speed of learning change as we train our networks? Let’s return to look at the network with just two hidden layers. The speed of learning changes as follows:

To generate these results, I used batch gradient descent with just 1,000 training images, trained over 500 epochs. This is a bit different than the way we usually train – I’ve used no mini-batches, and just 1,000 training images, rather than the full 50,000 image training set. I’m not trying to do anything sneaky, or pull the wool over your eyes, but it turns out that using mini-batch stochastic gradient descent gives much noisier (albeit very similar, when you average away the noise) results. Using the parameters I’ve chosen is an easy way of smoothing the results out, so we can see what’s going on.

In any case, as you can see the two layers start out learning at very different speeds (as we already know). The speed in both layers then drops very quickly, before rebounding. But through it all, the first hidden layer learns much more slowly than the second hidden layer.

What about more complex networks? Here’s the results of a similar experiment, but this time with three hidden layers (a $[784, 30, 30, 30, 10]$ network):

Again, early hidden layers learn much more slowly than later hidden layers. Finally, let’s add a fourth hidden layer (a

[784, 30, 30, 30, 30, 10]

network), and see what happens when we train:

Again, early hidden layers learn much more slowly than later hidden layers. In this case, the first hidden layer is learning roughly 100 times slower than the final hidden layer. No wonder we were having trouble training these networks earlier!

We have here an important observation: in at least some deep neural networks, the gradient tends to get smaller as we move backward through the hidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later layers. And while we’ve seen this in just a single network, there are fundamental reasons why this happens in many neural networks. The phenomenon is known as the vanishing gradient problem*

*See Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber (2001). This paper studied recurrent neural nets, but the essential phenomenon is the same as in the feedforward networks we are studying. See also Sepp Hochreiter’s earlier Diploma Thesis, Untersuchungen zu dynamischen neuronalen Netzen (1991, in German)..

Why does the vanishing gradient problem occur? Are there ways we can avoid it? And how should we deal with it in training deep neural networks? In fact, we’ll learn shortly that it’s not inevitable, although the alternative is not very attractive, either: sometimes the gradient gets much larger in earlier layers! This is the exploding gradient problem, and it’s not much better news than the vanishing gradient problem. More generally, it turns out that the gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers. This instability is a fundamental problem for gradient-based learning in deep neural networks. It’s something we need to understand, and, if possible, take steps to address.

One response to vanishing (or unstable) gradients is to wonder if they’re really such a problem. Momentarily stepping away from neural nets, imagine we were trying to numerically minimize a function $f(x)$ $f (x)$ of a single variable. Wouldn’t it be good news if the derivative $f'(x)$ $f^{'} (x)$ was small? Wouldn’t that mean we were already near an extremum? In a similar way, might the small gradient in early layers of a deep network mean that we don’t need to do much adjustment of the weights and biases?

Of course, this isn’t the case. Recall that we randomly initialized the weight and biases in the network. It is extremely unlikely our initial weights and biases will do a good job at whatever it is we want our network to do. To be concrete, consider the first layer of weights in a $[784, 30, 30, 30, 10]$ network for the MNIST problem. The random initialization means the first layer throws away most information about the input image. Even if later layers have been extensively trained, they will still find it extremely difficult to identify the input image, simply because they don’t have enough information. And so it can’t possibly be the case that not much learning needs to be done in the first layer. If we’re going to train deep networks, we need to figure out how to address the vanishing gradient problem.

───

此一『梯度消失問題』之提出

Vanishing gradient problem

In machine learning, the vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network’s weights receives an update proportional to the gradient of the error function with respect to the current weight in each iteration of training. Traditional activation functions such as the hyperbolic tangent function have gradients in the range $(-1, 1)$ or $[0, 1)$ , and backpropagation computes gradients by the chain rule. This has the effect of multiplying $n$ of these small numbers to compute gradients of the “front” layers in an $n$ -layer network, meaning that the gradient (error signal) decreases exponentially with $n$ and the front layers train very slowly.

With the advent of the back-propagation algorithm in the 1970s, many researchers tried to train supervised deep artificial neural networks from scratch, initially with little success. Sepp Hochreiter‘s diploma thesis of 1991^[1]^[2] formally identified the reason for this failure in the “vanishing gradient problem”, which not only affects many-layered feedforward networks, but also recurrent neural networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network.

When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

───

石破天驚，其人或將永垂神經網絡史乎？？

GDvp1

GDvp2

GDvp3

GDvp4

此一『扣鐘之問』，當真是大扣大鳴也！！？？

有時候『問』一個『好問題』，勝過『讀』萬卷『答案書』。就讓我們從了解什麼是『好問題』開始吧︰

已故的英國哲學家菲利帕‧福特 Philippa Foot，她於一九六七年發表了一篇名為《堕胎問题和教條雙重影響》論文，當中提出了『電車問題』，用來『批判』時下倫理哲學中之主流思想，尤其是功利主義的觀點︰

大部分之道德判斷都是依據『為最多的人謀取最大之利益』的『原则』而決定的。

她的原文引用如下︰

Suppose that a judge or magistrate is faced with rioters demanding that a culprit be found guilty for a certain crime and threatening otherwise to take their own bloody revenge on a particular section of the community. The real culprit being unknown, the judge sees himself as able to prevent the bloodshed only by framing some innocent person and having him executed.

Beside this example is placed another in which a pilot whose aeroplane is about to crash is deciding whether to steer from a more to a less inhabited area.

To make the parallel as close as possible it may rather be supposed that he is the driver of a runaway tram which he can only steer from one narrow track on to another; five men are working on one track and one man on the other; anyone on the track he enters is bound to be killed. In the case of the riots the mob have five hostages, so that in both examples the exchange is supposed to be one man’s life for the lives of five.

設想一個法官或推事正面對著暴徒的要求，有個罪犯必須為某犯行認罪服法，否則他們將自行報復血洗這個社區的特定區域。然而真正的犯行者未明，法官觀察到自己似能阻止這場血洗，只要將無辜者框陷處決就好。

除了這個例子之外，另一就是︰一位即將墬機的飛行員要如何決定導向哪個較多還是較少人居住的地方呢。

為了盡可能平行立論，就這樣假想吧︰某人駕駛一輛即將出軌的火車，他僅能從這一窄軌導向另一窄軌；一邊有五人正在軌上工作，另一軌上有一人；不論他進入何軌，那軌道上的人全都必死無疑。好比暴動中一暴民挾持五位人質一樣，當下的兩個例子中都是如此假設一命與五命之兌換。

這個倫理學之『難題』現今有許多不同的類似之『版本』，也許是因為它是這個領域中最為知名的『思想實驗』之一的罷！！

美籍匈牙利的大數學家 George Pólya 寫過一本《怎樣解題》的書，書中強調解題的『第一步』是『了解問題』是什麼？問題的『限制條件』又是什麼？『 概念定義』有沒有『歧義』？能不能『改寫重述』這個問題的呢？如果『解題者』能這樣作，那就是『思過半已』。

─── 摘自《勇闖新世界︰《 pyDatalog 》導引《十》豆鵝狐人之問題篇》

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】一

2016-06-27 懸鉤子

自由時報上的一篇文章

《微軟徵才出這樣的考題… 玩殘求職者》

600_phpuo3wDd

一名求職者被微軟主考官要求計算直角三角形的面積，結果求職者竟然答錯了。（圖擷取自《每日郵報》）

───

指出面對問題時，思維慣性之盲點。若說一般三角形，可用『底』、『底之高』、『高之分割點』來定義。那麼直角三角形必須滿足『畢氏定理』就限制了三者間的關係。略分說如下︰

假設 $\overline{A B} = a$ 、 $\overline{B C} = b$ 、 $\overline{C A} = c$ ，這個『底之高』 $h$ 與『底』相交於 $D$ ，將『底』分割成 $\overline{A D} = x$ 和 $\overline{C D} = y$ 兩部份。則有

$x + y = c \ \ \ \ (1)$

${(x+y)}^2 = c^2 = a^2 + b^2 \ \ \ \ (2)$

$a^2 = h^2 + x^2 \ \ \ \ (3)$

$b^2 = h^2 + y^2 \ \ \ \ (4)$

借 (2)(3)(4) 三式可得

$x \times y = h^2 \ \ \ \ (5)$

由 (1)(5) 可知 $x$ 、 $y$ 構成『二次方程式』

$(z - x)(z-y) = 0 = z^2 - (x+y) \cdot z + x \times y = z^2 - c \cdot z + h^2 = 0$

的根。若有實數解，則必須『判別式』

$c^2 - 4 h^2 \ge 0$

，因此 $h \le \frac{c}{2}$ 。故知此『題意』之直角三角形『不存在』。

然而『面試』焉有時間作此『解析計算』耶？？如果那人要是知道『圓心角』是『圓周角』兩倍，一個圓有三百六十度，自然會知以『底』為『直徑』的直角三角形，最大之『高』不過『半徑』而已！！然而能夠這樣推理果真容易乎？？！！

如是或知 Michael Nielsen 先生欲言凡事祇『想當然爾』之不足也。

Imagine you’re an engineer who has been asked to design a computer from scratch. One day you’re working away in your office, designing logical circuits, setting out AND gates, OR gates, and so on, when your boss walks in with bad news. The customer has just added a surprising design requirement: the circuit for the entire computer must be just two layers deep:

You’re dumbfounded, and tell your boss: “The customer is crazy!”

Your boss replies: “I think they’re crazy, too. But what the customer wants, they get.”

In fact, there’s a limited sense in which the customer isn’t crazy. Suppose you’re allowed to use a special logical gate which lets you AND together as many inputs as you want. And you’re also allowed a many-input NAND gate, that is, a gate which can AND multiple inputs and then negate the output. With these special gates it turns out to be possible to compute any function at all using a circuit that’s just two layers deep.

But just because something is possible doesn’t make it a good idea. In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate the solutions. In other words, we build up to a solution through multiple layers of abstraction.

For instance, suppose we’re designing a logical circuit to multiply two numbers. Chances are we want to build it up out of sub-circuits doing operations like adding two numbers. The sub-circuits for adding two numbers will, in turn, be built up out of sub-sub-circuits for adding two bits. Very roughly speaking our circuit will look like:

That is, our final circuit contains at least three layers of circuit elements. In fact, it’ll probably contain more than three layers, as we break the sub-tasks down into smaller units than I’ve described. But you get the general idea.

So deep circuits make the process of design easier. But they’re not just helpful for design. There are, in fact, mathematical proofs showing that for some functions very shallow circuits require exponentially more circuit elements to compute than do deep circuits. For instance, a famous series of papers in the early 1980s*

*The history is somewhat complex, so I won’t give detailed references. See Johan Håstad’s 2012 paper On the correlation of parity and small-depth circuits for an account of the early history and references.

showed that computing the parity of a set of bits requires exponentially many gates, if done with a shallow circuit. On the other hand, if you use deeper circuits it’s easy to compute the parity using a small circuit: you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity. Deep circuits thus can be intrinsically much more powerful than shallow circuits.

Up to now, this book has approached neural networks like the crazy customer. Almost all the networks we’ve worked with have just a single hidden layer of neurons (plus the input and output layers):

These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy! Nonetheless, intuitively we’d expect networks with many more hidden layers to be more powerful:

Such networks could use the intermediate layers to build up multiple layers of abstraction, just as we do in Boolean circuits. For instance, if we’re doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges, the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges. The third layer would then recognize still more complex shapes. And so on. These multiple layers of abstraction seem likely to give deep networks a compelling advantage in learning to solve complex pattern recognition problems. Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow networks*

*For certain problems and network architectures this is proved in On the number of response regions of deep feed forward networks with piece-wise linear activations, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio (2014). See also the more informal discussion in section 2 of Learning deep architectures for AI, by Yoshua Bengio (2009)..

How can we train such deep networks? In this chapter, we’ll try training deep networks using our workhorse learning algorithm – stochastic gradient descent by backpropagation. But we’ll run into trouble, with our deep networks not performing much (if at all) better than shallow networks.

That failure seems surprising in the light of the discussion above. Rather than give up on deep networks, we’ll dig down and try to understand what’s making our deep networks hard to train. When we look closely, we’ll discover that the different layers in our deep network are learning at vastly different speeds. In particular, when later layers in the network are learning well, early layers often get stuck during training, learning almost nothing at all. This stuckness isn’t simply due to bad luck. Rather, we’ll discover there are fundamental reasons the learning slowdown occurs, connected to our use of gradient-based learning techniques.

As we delve into the problem more deeply, we’ll learn that the opposite phenomenon can also occur: the early layers may be learning well, but later layers can become stuck. In fact, we’ll find that there’s an intrinsic instability associated to learning by gradient descent in deep, many-layer neural networks. This instability tends to result in either the early or the later layers getting stuck during training.

This all sounds like bad news. But by delving into these difficulties, we can begin to gain insight into what’s required to train deep networks effectively. And so these investigations are good preparation for the next chapter, where we’ll use deep learning to attack image recognition problems.

───

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【普遍性】

2016-06-26 懸鉤子

Michael Nielsen 先生用『小品文』essay 形式、視覺『驗證』方法、動態『操作』來展現『神經網絡』可以計算『任何函數』的能力。既說是『』品的了，還請讀者一小口一小口細嚼慢嚥也。

One of the most striking facts about neural networks is that they can compute any function at all. That is, suppose someone hands you some complicated, wiggly function, $f(x)$ $f (x)$ :

……

Summing up, a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision. In this chapter we’ll actually prove a slightly weaker version of this result, using two hidden layers instead of one. In the problems I’ll briefly outline how the explanation can, with a few tweaks, be adapted to give a proof which uses only a single hidden layer.

……

Conclusion

The explanation for universality we’ve discussed is certainly not a practical prescription for how to compute using neural networks! In this, it’s much like proofs of universality for NAND gates and the like. For this reason, I’ve focused mostly on trying to make the construction clear and easy to follow, and not on optimizing the details of the construction. However, you may find it a fun and instructive exercise to see if you can improve the construction.

Although the result isn’t directly useful in constructing networks, it’s important because it takes off the table the question of whether any particular function is computable using a neural network. The answer to that question is always “yes”. So the right question to ask is not whether any particular function is computable, but rather what’s a good way to compute the function.

The universality construction we’ve developed uses just two hidden layers to compute an arbitrary function. Furthermore, as we’ve discussed, it’s possible to get the same result with just a single hidden layer. Given this, you might wonder why we would ever be interested in deep networks, i.e., networks with many hidden layers. Can’t we simply replace those networks with shallow, single hidden layer networks?

Chapter acknowledgments: Thanks to Jen Dodd and Chris Olah for many discussions about universality in neural networks. My thanks, in particular, to Chris for suggesting the use of a lookup table to prove universality. The interactive visual form of the chapter is inspired by the work of people such as Mike Bostock, Amit Patel, Bret Victor, and Steven Wittens.

While in principle that’s possible, there are good practical reasons to use deep networks. As argued in Chapter 1, deep networks have a hierarchical structure which makes them particularly well adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems. Put more concretely, when attacking problems such as image recognition, it helps to use a system that understands not just individual pixels, but also increasingly more complex concepts: from edges to simple geometric shapes, all the way up through complex, multi-object scenes. In later chapters, we’ll see evidence suggesting that deep networks do a better job than shallow networks at learning such hierarchies of knowledge. To sum up: universality tells us that neural networks can compute any function; and empirical evidence suggests that deep networks are the networks best adapted to learn the functions useful in solving many real-world problems.

───

所謂『universality』哲學上譯作『共相』

共相（英語：Universal），又譯為普遍性，是一哲學與佛學常用詞彙，但兩者意思不盡相同。在形上學中，共相是指在個別物體中所擁有的共通特性。舉例來說，在房間中，存在兩張個別的綠色椅子，在這兩張椅子中，都存在共通的「椅子」特質，以及「綠色」的性質。各別的椅子，稱為殊相，而這兩個各別椅子中共通的特質，被稱為共相。

……

須慎思明辨古今東西有

共相之爭

在中世紀，哲學家曾因共相是否真實存在開始了共相之爭，而當中有三種不同的主流見解：

唯實論：觀念世界比感官世界更早出現，因此共相早於事物。
唯名論：和唯實論抱持的觀點相反，認為共相後於事物。
概念論：綜合了唯名論和唯實論兩個學說，提出共相存在於事物當中。

彼得·阿伯拉的哲學名聲與其指稱理論（doctrine of name）或唯名論（nominalism）有關。彼得·阿伯拉認為字詞只是名稱，即「能指」（signifiers）。並非所有的字詞都能指稱實體。彼得·阿伯拉特別指出，這個問題可上溯至柏拉圖與亞里斯多德的討論，例如用來指稱類別、性質與理想類型的字詞。problem of universals在於，這些用來指稱類別、性質與理想類型的字詞，實際上指稱的是不是實在的實體。

有些邏輯學家－－被稱為「唯實論者」（realist）－－堅持類別、性質與理想類型這些特有的實體的確存在，其他邏輯學家－－被稱為「概念論者」（conceptualist）－－則堅持，共相只存在於心靈中。彼得·阿伯拉採取激進的觀點，認為除了個別事物之外，任何事物均不存在。他反對共相存在，並且否認唯實論者的觀點，後者主張事物擁有本質，而本質可讓事物如其所是。彼得·阿伯拉則認為，世上並不存在本質，字詞以共相來誤導我們的思考，但是共相並非實在之物；當我們使用語言時，字詞只是我們假設的建構物。

彼得·阿伯拉運用字詞與實在之間的嚴格區分來詮釋《聖經》，他認為宗教權威之間的明顯衝突，可藉由觀察他們如何以不同方式使用同一字詞，輕易獲得解決。彼得·阿伯拉是第一位以現代意義使用「神學」一詞的哲學家，他用神學來指稱針對宗教神祕所進行的理性調查。彼得·阿伯拉重拾已經持續千年的論辯，主張應以理性來審視啟示，凡是未經理性辯護的信仰，都只能算是意見。由於彼得·阿伯拉相信理性能洞察宗教真理，因此他堅持古希臘人對於基督教教義有著值得讚賞的貢獻，即使他們僅僅約略瞥見了三位一體的本質。

「共相之爭」爭論的重點，在於「知識」與「本體」間互動的課題。原來，柏拉圖的「觀念」、「感官」的二元世界，以「觀念界」的先天性，作為「感官」後天「知識」的基準，因此認為「知識」先於「感官」事物，這就形成傾向柏拉圖思想的「實在論」（Realismus），甚至發展成「極端實在論」（Exaggerate Realism），此派主張「共相先於事物」（Universale ante rem）；而亞里斯多德的學說，則以為「知識」當始自「感官」經驗，而「觀念」則為「知識」後天所獲得，偏向亞里斯多德的學者主張「共相後於事物」（Universale post rem）。這種你來我往的辯論，到後來各說各話，一直到「大學」成立，學術漸上軌道後，才出現折衷派的「共相在事物中」（Universale in re）；算是表面上平息了「共相之爭」。

「共相」問題，基本上是蘇格拉底的「概念」（eidos, concept），柏拉圖的「觀念」（idea），亞里斯多德的「實體」（ousia, substance），是在檢證吾人的「思想」如何把握「存在」的「知識論」根本課題。「先物共相」（共相先於事物）是柏拉圖的主張，「後物共相」（共相後於事物）是亞里斯多德跟隨乃師蘇格拉底的學說，而「在物共相」（共相在事物之中）則是綜合和折衷二者所形成。當然，這種折衷也就是中世紀「共相之爭」的「知識論」成果之一。

───

須詳參細究其與『圖零完備性』之關聯︰

Turing completeness

In computability theory, a system of data-manipulation rules (such as a computer’s instruction set, a programming language, or a cellular automaton) is said to be Turing complete or computationally universal if it can be used to simulate any single-taped Turing machine. The concept is named after English mathematician Alan Turing. A classic example is lambda calculus.

A closely related concept is that of Turing equivalence – two computers P and Q are called equivalent if P can simulate Q and Q can simulate P. The Church–Turing thesis conjectures that any function whose values can be computed by an algorithm can be computed by a Turing machine, and therefore that if any real-world computer can be simulated by a Turing machine, it is Turing equivalent to a Turing machine. A Universal Turing machine can be used to simulate any Turing machine and by extension the computational aspects of any possible real-world computer.^{[NB 1]}

To show that something is Turing complete, it is enough to show that it can be used to simulate some Turing complete system. For example, an imperative language is Turing complete if it has conditional branching (e.g., “if” and “goto” statements, or a “branch if zero” instruction. See OISC) and the ability to change an arbitrary amount of memory (e.g., the ability to maintain an arbitrary number of variables). Since this is almost always the case, most (if not all) imperative languages are Turing complete if the limitations of finite memory are ignored.

……

Non-mathematical usage

In colloquial usage, the terms “Turing complete” or “Turing equivalent” are used to mean that any real-world general-purpose computer or computer language can approximately simulate the computational aspects of any other real-world general-purpose computer or computer language.

Real computers constructed so far are essentially similar to a single-tape Turing machine; thus the associated mathematics can apply by abstracting their operation far enough. However, real computers have limited physical resources, so they are only linear bounded automaton complete. In contrast, a universal computer is defined as a device with a Turing complete instruction set, infinite memory, and infinite available time.

───

或可借著

紙張、鉛筆和橡皮擦

胡適名言︰

寧鳴而死，不默而生。
要怎麼收穫，先那麼栽。
為學有如金字塔，要能廣大，要能高。
保守永遠是多數，年輕人只管向前衝。

蘭花草
詞︰胡適曲︰佚名

我從山中來帶著蘭花草
種在小園中希望花開早
一日看三回看得花時過
蘭花卻依然苞也無一個

轉眼秋天到移蘭入暖房
朝朝頻顧惜夜夜不相忘
期待春花開能將夙愿償
滿庭花簇簇添得許多香

當『博大精深』遇上『溫文婉約』是否『輾轉反側』？？

來一趟知性之旅耶？？？

Thue 之改寫系統《一》

λ 運算︰淵源介紹

FreeSandal

每月彙整: 2016 年 6 月

W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】四

Other obstacles to deep learning

W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】三

What’s causing the vanishing gradient problem? Unstable gradients in deep neural nets

Unstable gradients in more complex networks

W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】二

The vanishing gradient problem

Vanishing gradient problem

W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】一

CHAPTER 5

Why are deep neural networks hard to train?

W!o+ 的《小伶鼬工坊演義》︰神經網絡【普遍性】

CHAPTER 4

A visual proof that neural nets can compute any function

Conclusion

共相之爭

Turing completeness

Non-mathematical usage

輕。鬆。學。部落客

2016 年 6 月
日	一	二	三	四	五	六
« 5 月				7 月 »
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30