W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】三

是否可能『無限邊界』只包圍『有限面積』??難道真的『重物』跑得比『輕物』較快!!人因懷疑而論辯,因無法實驗而思想︰

220px-Fractaldimensionexample

220px-KochFlake.svg

Onetwosix

人們『直覺』上都知道,『』是一維的、『』是二維的以及『』是三維的,也許同意『』是零維的。但是要怎麽定義『維度』 呢?假使設想用一根有刻度的 R 尺,量一條線段,得到 l \eqcirc ── 單位刻度 ──,如果有另一根 R^' 的尺,它的單位刻度 \eqcirc^'R 尺的 \lambda 倍,也就是說 \eqcirc^{'} \   = \lambda \cdot \eqcirc,那用這R^' 之尺來量該線段將會得到 l \cdot \lambda^{-1} \eqcirc^'。同樣的如果 R 尺量一個正方形得到數值 l^2,那用R^' 之尺來量就會得到數值 l^2 \cdot \lambda^{-2},這樣那R^{' }, R 兩尺的度量之數值比 N = \lambda^{-2}
於是德國數學家費利克斯‧豪斯多夫 Felix Hausdorff 是這樣子定義『維度D 的︰

-D =\lim\limits_{\lambda \to 0}  \frac{log N} {log \lambda}

,使它也能適用於『分形』的『分維』。

在科赫雪花的建構過程中,從最初 n=0 的 △,到第 n 步時︰

總邊數︰ N_n = 3 \cdot 4^n
單一邊長︰ L_n = (\frac {1}{3})^n
總周長︰l_n = 3 \cdot (\frac {4}{3})^n
圍繞面積︰A_n = \frac {1}{5} (8 - 3 (\frac {4}{9})^n) \triangle

因此科赫雪花的分維是

-\lim\limits_{n \to \infty}  \frac{log N_n} {log L_n} = \frac{log 4}{log 3} = 1.261859507\cdots

,它所圍繞的極限面積 A_{\infty} = \frac{8}{5} \triangle,那時的總周長 l_{\infty} = \infty

galileo

titlepage1638

GalileoInclinedPlane00

Galileo's own notes on the rolling ball experiment.

How Galileo’s spy glass upended science

galileo-moons

Graphic decription of Venus phases

伽利略比薩斜塔

人類天性中的直覺是認識事物、判斷是非和分辨真假非常重要的能力,但是它通常是『素樸的』,所以『概念的分析』也往往是一件『必要的』事。雖然面對『錯綜複雜』的現象,未經粹煉的『思想』容易誤入歧途,但是『直覺理解』的事物,常常流於『想當然耳』之過失。伽利略老年著作了一本『Discorsi e dimostrazioni matematiche』Mathematical Discourses and Demonstrations,據說是『思想實驗』的起始處︰
……
Salviati︰ If then we take two bodies whose natural speeds are different, it is clear that on uniting the two, the more rapid one will be partly retarded by the slower, and the slower will be somewhat hastened by the swifter. Do you not agree with me in this opinion?

Simplicio.︰You are unquestionably right.

Salviati.︰But if this is true, and if a large stone moves with a speed of, say, eight while a smaller moves with a speed of four, then when they are united, the system will move with a speed less than eight; but the two stones when tied together make a stone larger than that which before moved with a speed of eight. Hence the heavier body moves with less speed than the lighter; an effect which is contrary to your supposition. Thus you see how, from your assumption that the heavier body moves more rapidly than ‘ the lighter one, I infer that the heavier body moves more slowly.
……
這段對話用著想像分析、邏輯推導、討論演示著亞里斯多德的『重的東西動得快』的錯誤。概略的說,如果□與○二物重量不同,往同一個方向運動,□也跑得比較 快,設想用一條繩索將這兩者連在一起,那快的應該會被拖慢,而慢的將會被拉快的吧。假使這是件事實,這樣『□+○』整體來看不是更重的東西嗎?難道它不與 『重的東西動得快』矛盾的嗎??

古希臘哲學家埃利亞的芝諾 Ζήνων 宣稱︰即使是參與了特洛伊戰爭古希臘神話中的阿基里斯 Achilles ── 希臘第一勇士 ── 也追不上領先一步之烏龜,因為

動得再慢的物體也不可能被動得更快的物體追上。因為追逐者總得先抵達被逐者上個出發點,然而被逐者此刻早已經前往下個所在點。於是被逐者永遠在追逐者之前一步。

運動』的概念即使能直覺掌握,卻從來都不是簡單的,就像所有瞬刻都可以說它是『靜止』的『箭矢』,又為什麼不能歸結出『動矢不動』的結論的呢??

─── 摘自《思想實驗!!

 

如是者自然喜讀 Michael Nielsen 先生一氣呵成之大哉論也︰

What’s causing the vanishing gradient problem? Unstable gradients in deep neural nets

To get insight into why the vanishing gradient problem occurs, let’s consider the simplest deep neural network: one with just a single neuron in each layer. Here’s a network with three hidden layers:

Here, w_1, w_2, \ldots are the weights, b_1, b_2, \ldots are the biases, and C is some cost function. Just to remind you how this works, the output a_j from the jth neuron is \sigma(z_j), where \sigma is the usual sigmoid activation function, and z_j = w_{j} a_{j-1}+b_j is the weighted input to the neuron. I’ve drawn the cost C at the end to emphasize that the cost is a function of the network’s output, a_4: if the actual output from the network is close to the desired output, then the cost will be low, while if it’s far away, the cost will be high.

We’re going to study the gradient \partial C / \partial b_1 associated to the first hidden neuron. We’ll figure out an expression for \partial C / \partial b_1, and by studying that expression we’ll understand why the vanishing gradient problem occurs.

I’ll start by simply showing you the expression for \partial C / \partial b_1. It looks forbidding, but it’s actually got a simple structure, which I’ll describe in a moment. Here’s the expression (ignore the network, for now, and note that \sigma' is just the derivative of the \sigma function):

The structure in the expression is as follows: there is a \sigma'(z_j) term in the product for each neuron in the network; a weight w_j term for each weight in the network; and a final \partial C / \partial a_4 term, corresponding to the cost function at the end. Notice that I’ve placed each term in the expression above the corresponding part of the network. So the network itself is a mnemonic for the expression.

You’re welcome to take this expression for granted, and skip to the discussion of how it relates to the vanishing gradient problem. There’s no harm in doing this, since the expression is a special case of our earlier discussion of backpropagation. But there’s also a simple explanation of why the expression is true, and so it’s fun (and perhaps enlightening) to take a look at that explanation.

Imagine we make a small change \Delta b_1 in the bias b_1. That will set off a cascading series of changes in the rest of the network. First, it causes a change \Delta a_1 in the output from the first hidden neuron. That, in turn, will cause a change \Delta z_2 in the weighted input to the second hidden neuron. Then a change \Delta a_2 in the output from the second hidden neuron. And so on, all the way through to a change \Delta C in the cost at the output. We have

\frac{\partial C}{\partial b_1} \approx \frac{\Delta C}{\Delta b_1}. \ \ \ \ (114)

This suggests that we can figure out an expression for the gradient \partial C / \partial b_1 by carefully tracking the effect of each step in this cascade.

To do this, let’s think about how \Delta b_1 causes the output a_1 from the first hidden neuron to change. We have a_1 = \sigma(z_1) = \sigma(w_1 a_0 + b_1), so

\Delta a_1  \approx \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \ \ \ \ (115)
= \sigma'(z_1) \Delta b_1. \ \ \ \ (116)

That \sigma'(z_1) term should look familiar: it’s the first term in our claimed expression for the gradient \partial C / \partial b_1. Intuitively, this term converts a change \Delta b_1 in the bias into a change \Delta a_1 in the output activation. That change \Delta a_1 in turn causes a change in the weighted input z_2 = w_2 a_1 + b_2 to the second hidden neuron:

\Delta z_2  \approx \frac{\partial z_2}{\partial a_1} \Delta a_1 \ \ \ \ (117)
= w_2 \Delta a_1. \ \ \ \ (118)

Combining our expressions for \Delta z_2 and \Delta a_1, we see how the change in the bias b_1 propagates along the network to affect z_2:

\Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1. \ \ \ \ (119)

Again, that should look familiar: we’ve now got the first two terms in our claimed expression for the gradient \partial C / \partial b_1.

We can keep going in this fashion, tracking the way changes propagate through the rest of the network. At each neuron we pick up a \sigma'(z_j) term, and through each weight we pick up aw_j term. The end result is an expression relating the final change \Delta C in cost to the initial change \Delta b_1 in the bias:

\Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4)  \frac{\partial C}{\partial a_4} \Delta b_1. \ \ \ \ (120)

Dividing by \Delta b_1 we do indeed get the desired expression for the gradient:

\frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4) \frac{\partial C}{\partial a_4}. \ \ \ \ (121)

Why the vanishing gradient problem occurs: To understand why the vanishing gradient problem occurs, let’s explicitly write out the entire expression for the gradient:

\frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}. \ \ \ \ (122)

Excepting the very last term, this expression is a product of terms of the form w_j \sigma'(z_j). To understand how each of those terms behave, let’s look at a plot of the function \sigma':

vgp-s

The derivative reaches a maximum at \sigma'(0) = 1/4. Now, if we use our standard approach to initializing the weights in the network, then we’ll choose the weights using a Gaussian with mean 0 and standard deviation 1. So the weights will usually satisfy |w_j| < 1. Putting these observations together, we see that the terms w_j \sigma'(z_j) will usually satisfy |w_j \sigma'(z_j)| < 1/4. And when we take a product of many such terms, the product will tend to exponentially decrease: the more terms, the smaller the product will be. This is starting to smell like a possible explanation for the vanishing gradient problem.

To make this all a bit more explicit, let’s compare the expression for \partial C / \partial b_1 to an expression for the gradient with respect to a later bias, say \partial C / \partial b_3. Of course, we haven’t explicitly worked out an expression for \partial C / \partial b_3, but it follows the same pattern described above for \partial C / \partial b_1. Here’s the comparison of the two expressions:

The two expressions share many terms. But the gradient \partial C / \partial b_1 includes two extra terms each of the form w_j \sigma'(z_j). As we’ve seen, such terms are typically less than 1/4 in magnitude. And so the gradient \partial C / \partial b_1 will usually be a factor of 16 (or more) smaller than \partial C / \partial b_3. This is the essential origin of the vanishing gradient problem.

Of course, this is an informal argument, not a rigorous proof that the vanishing gradient problem will occur. There are several possible escape clauses. In particular, we might wonder whether the weights w_j could grow during training. If they do, it’s possible the terms w_j \sigma'(z_j) in the product will no longer satisfy |w_j \sigma'(z_j)| < 1/4. Indeed, if the terms get large enough – greater than 1 – then we will no longer have a vanishing gradient problem. Instead, the gradient will actually grow exponentially as we move backward through the layers. Instead of a vanishing gradient problem, we’ll have an exploding gradient problem.

The exploding gradient problem: Let’s look at an explicit example where exploding gradients occur. The example is somewhat contrived: I’m going to fix parameters in the network in just the right way to ensure we get an exploding gradient. But even though the example is contrived, it has the virtue of firmly establishing that exploding gradients aren’t merely a hypothetical possibility, they really can happen.

There are two steps to getting an exploding gradient. First, we choose all the weights in the network to be large, say w_1 = w_2 = w_3 = w_4 = 100. Second, we’ll choose the biases so that the \sigma'(z_j) terms are not too small. That’s actually pretty easy to do: all we need do is choose the biases to ensure that the weighted input to each neuron is z_j = 0 (and so \sigma'(z_j) = 1/4. So, for instance, we want z_1 = w_1 a_0 + b_1 = 0. We can achieve this by setting b_1 = -100 * a_0. We can use the same idea to select the other biases. When we do this, we see that all the terms w_j \sigma'(z_j) are equal to 100 * \frac{1}{4} = 25. With these choices we get an exploding gradient.

The unstable gradient problem: The fundamental problem here isn’t so much the vanishing gradient problem or the exploding gradient problem. It’s that the gradient in early layers is the product of terms from all the later layers. When there are many layers, that’s an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out. Without some mechanism or underlying reason for that balancing to occur, it’s highly unlikely to happen simply by chance. In short, the real problem here is that neural networks suffer from an unstable gradient problem. As a result, if we use standard gradient-based learning techniques, different layers in the network will tend to learn at wildly different speeds.

The prevalence of the vanishing gradient problem: We’ve seen that the gradient can either vanish or explode in the early layers of a deep network. In fact, when using sigmoid neurons the gradient will usually vanish. To see why, consider again the expression |w \sigma'(z)|. To avoid the vanishing gradient problem we need |w \sigma'(z)| \geq 1. You might think this could happen easily if w is very large. However, it’s more difficult than it looks. The reason is that the \sigma'(z) term also depends on w: \sigma'(z) = \sigma'(wa +b), where a is the input activation. So when we make w large, we need to be careful that we’re not simultaneously making \sigma'(wa+b) small. That turns out to be a considerable constraint. The reason is that when we make w large we tend to make wa+b very large. Looking at the graph of \sigma' you can see that this puts us off in the “wings” of the \sigma' function, where it takes very small values. The only way to avoid this is if the input activation falls within a fairly narrow range of values (this qualitative explanation is made quantitative in the first problem below). Sometimes that will chance to happen. More often, though, it does not happen. And so in the generic case we have vanishing gradients.

Unstable gradients in more complex networks

We’ve been studying toy networks, with just one neuron in each hidden layer. What about more complex deep networks, with many neurons in each hidden layer?

In fact, much the same behaviour occurs in such networks. In the earlier chapter on backpropagation we saw that the gradient in the lth layer of an L layer network is given by:

\delta^l = \Sigma'(z^l) (w^{l+1})^T \Sigma'(z^{l+1}) (w^{l+2})^T \ldots \Sigma'(z^L) \nabla_a C \ \ \ \ (124)

Here, \Sigma'(z^l) is a diagonal matrix whose entries are the \sigma'(z) values for the weighted inputs to the lth layer. The w^l are the weight matrices for the different layers. And \nabla_a C is the vector of partial derivatives of C with respect to the output activations.

This is a much more complicated expression than in the single-neuron case. Still, if you look closely, the essential form is very similar, with lots of pairs of the form (w^j)^T \Sigma'(z^j). What’s more, the matrices \Sigma'(z^j) have small entries on the diagonal, none larger than \frac{1}{4}. Provided the weight matrices wj aren’t too large, each additional term (w^j)^T \Sigma'(z^l) tends to make the gradient vector smaller, leading to a vanishing gradient. More generally, the large number of terms in the product tends to lead to an unstable gradient, just as in our earlier example. In practice, empirically it is typically found in sigmoid networks that gradients vanish exponentially quickly in earlier layers. As a result, learning slows down in those layers. This slowdown isn’t merely an accident or an inconvenience: it’s a fundamental consequence of the approach we’re taking to learning.