W!o+ 的《小伶鼬工坊演義》︰神經網絡【梯度消失問題】一

自由時報上的一篇文章

微軟徵才出這樣的考題… 玩殘求職者

600_phpuo3wDd

一名求職者被微軟主考官要求計算直角三角形的面積,結果求職者竟然答錯了。(圖擷取自 《每日郵報》)

───

 

指出面對問題時,思維慣性之盲點。若說一般三角形,可用『底』 、『底之高』、『高之分割點』來定義。那麼直角三角形必須滿足『畢氏定理』就限制了三者間的關係。略分說如下︰

假設 \overline{A B} = a\overline{B C} = b\overline{C A} = c ,這個『底之高』 h 與『底』相交於 D , 將『底』分割成 \overline{A D} = x\overline{C D} = y 兩部份。則有

x + y = c \ \ \ \  (1)

{(x+y)}^2 = c^2 = a^2 + b^2 \ \ \ \ (2)

a^2 = h^2 + x^2 \ \ \ \ (3)

b^2 = h^2 + y^2 \ \ \ \ (4)

借 (2)(3)(4) 三式可得

x \times y = h^2 \ \ \ \ (5)

由 (1)(5) 可知 xy 構成『二次方程式』

(z - x)(z-y) = 0 = z^2 - (x+y) \cdot z + x \times y = z^2 - c \cdot z + h^2 = 0

的根。若有實數解,則必須『判別式』

c^2 - 4 h^2 \ge 0

,因此 h \le \frac{c}{2} 。故知此『題意』之直角三角形『不存在』。

然而『面試』焉有時間作此『解析計算』耶??如果那人要是知道『圓心角』是『圓周角』兩倍,一個圓有三百六十度,自然會知以『底』為『直徑』的直角三角形,最大之『高』不過『半徑』而已 !!然而能夠這樣推理果真容易乎??!!

 

如是或知 Michael Nielsen 先生欲言凡事祇『想當然爾』之不足也。

Imagine you’re an engineer who has been asked to design a computer from scratch. One day you’re working away in your office, designing logical circuits, setting out AND gates, OR gates, and so on, when your boss walks in with bad news. The customer has just added a surprising design requirement: the circuit for the entire computer must be just two layers deep:

You’re dumbfounded, and tell your boss: “The customer is crazy!”

Your boss replies: “I think they’re crazy, too. But what the customer wants, they get.”

In fact, there’s a limited sense in which the customer isn’t crazy. Suppose you’re allowed to use a special logical gate which lets you AND together as many inputs as you want. And you’re also allowed a many-input NAND gate, that is, a gate which can AND multiple inputs and then negate the output. With these special gates it turns out to be possible to compute any function at all using a circuit that’s just two layers deep.

But just because something is possible doesn’t make it a good idea. In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate the solutions. In other words, we build up to a solution through multiple layers of abstraction.

For instance, suppose we’re designing a logical circuit to multiply two numbers. Chances are we want to build it up out of sub-circuits doing operations like adding two numbers. The sub-circuits for adding two numbers will, in turn, be built up out of sub-sub-circuits for adding two bits. Very roughly speaking our circuit will look like:

That is, our final circuit contains at least three layers of circuit elements. In fact, it’ll probably contain more than three layers, as we break the sub-tasks down into smaller units than I’ve described. But you get the general idea.

So deep circuits make the process of design easier. But they’re not just helpful for design. There are, in fact, mathematical proofs showing that for some functions very shallow circuits require exponentially more circuit elements to compute than do deep circuits. For instance, a famous series of papers in the early 1980s*

*The history is somewhat complex, so I won’t give detailed references. See Johan Håstad’s 2012 paper On the correlation of parity and small-depth circuits for an account of the early history and references.

showed that computing the parity of a set of bits requires exponentially many gates, if done with a shallow circuit. On the other hand, if you use deeper circuits it’s easy to compute the parity using a small circuit: you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity. Deep circuits thus can be intrinsically much more powerful than shallow circuits.

Up to now, this book has approached neural networks like the crazy customer. Almost all the networks we’ve worked with have just a single hidden layer of neurons (plus the input and output layers):

These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy! Nonetheless, intuitively we’d expect networks with many more hidden layers to be more powerful:

Such networks could use the intermediate layers to build up multiple layers of abstraction, just as we do in Boolean circuits. For instance, if we’re doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges, the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges. The third layer would then recognize still more complex shapes. And so on. These multiple layers of abstraction seem likely to give deep networks a compelling advantage in learning to solve complex pattern recognition problems. Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow networks*

*For certain problems and network architectures this is proved in On the number of response regions of deep feed forward networks with piece-wise linear activations, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio (2014). See also the more informal discussion in section 2 of Learning deep architectures for AI, by Yoshua Bengio (2009)..

How can we train such deep networks? In this chapter, we’ll try training deep networks using our workhorse learning algorithm – stochastic gradient descent by backpropagation. But we’ll run into trouble, with our deep networks not performing much (if at all) better than shallow networks.

That failure seems surprising in the light of the discussion above. Rather than give up on deep networks, we’ll dig down and try to understand what’s making our deep networks hard to train. When we look closely, we’ll discover that the different layers in our deep network are learning at vastly different speeds. In particular, when later layers in the network are learning well, early layers often get stuck during training, learning almost nothing at all. This stuckness isn’t simply due to bad luck. Rather, we’ll discover there are fundamental reasons the learning slowdown occurs, connected to our use of gradient-based learning techniques.

As we delve into the problem more deeply, we’ll learn that the opposite phenomenon can also occur: the early layers may be learning well, but later layers can become stuck. In fact, we’ll find that there’s an intrinsic instability associated to learning by gradient descent in deep, many-layer neural networks. This instability tends to result in either the early or the later layers getting stuck during training.

This all sounds like bad news. But by delving into these difficulties, we can begin to gain insight into what’s required to train deep networks effectively. And so these investigations are good preparation for the next chapter, where we’ll use deep learning to attack image recognition problems.

───