W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】七

為什麼一個演算法之『計算速度』值得大書特書的呢??

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm? To answer this question, let’s consider another approach to computing the gradient. Imagine it’s the early days of neural networks research. Maybe it’s the 1950s or 1960s, and you’re the first person in the world to think of using gradient descent to learn! But to make the idea work you need a way of computing the gradient of the cost function. You think back to your knowledge of calculus, and decide to see if you can use the chain rule to compute the gradient. But after playing around a bit, the algebra looks complicated, and you get discouraged. So you try to find another approach. You decide to regard the cost as a function of the weights C = C(w) alone (we’ll get back to the biases in a moment). You number the weights w_1, w_2, \ldots and want to compute \partial C / \partial w_j for some particular weight w_j. An obvious way of doing that is to use the approximation

C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon}, \ \ \ \ (46)

where \epsilon > 0 is a small positive number, and e_j is the unit vector in the j^{\rm th} direction. In other words, we can estimate \partial C / \partial w_j by computing the cost C for two slightly different values of w_j, and then applying Equation (46). The same idea will let us compute the partial derivatives \partial C / \partial b with respect to the biases.

This approach looks very promising. It’s simple conceptually, and extremely easy to implement, using just a few lines of code. Certainly, it looks much more promising than the idea of using the chain rule to compute the gradient!

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight w_j we need to compute C(w+\epsilon e_j) in order to compute \partial C / \partial w_j. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute C(w) as well, so that’s a total of a million and one passes through the network.

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives \partial C / \partial w_j using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass*

*This should be plausible, but it requires some analysis to make a careful statement. It’s plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it’s multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost..

And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes we needed for the approach based on (46)! And so even though backpropagation appears superficially more complex than the approach based on (46), it’s actually much, much faster.

This speedup was first fully appreciated in 1986, and it greatly expanded the range of problems that neural networks could solve. That, in turn, caused a rush of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s people ran up against limits, especially when attempting to use backpropagation to train deep neural networks, i.e., networks with many hidden layers. Later in the book we’ll see how modern computers and some clever new ideas now make it possible to use backpropagation to train such deep neural networks.

───

 

因為『實務機器』是『受限的』 bounded ,所以講『記憶體』、『輸出入快慢』、『CPU 速度』…… 都是『有限資源』!?如果說對比『理論之圖靈機』,它的『讀寫磁帶』是『無限的』!它的『狀態變遷速度』是無關緊要的!!

因此假使從『無限長』 \infty 之天外時間來看, 2^{1024} 年是否算久的呢 ??!!要是自『人的壽限』觀之 ,怕連 2^8 年都捱不了的吧!! ??此即是『正寫』為什麼人總得談那個『速度』的哩!!!或許這時『反思』『自然創造』,生物之『成長學習』卻一直是『按部就班』,豈不令人驚奇的耶???

如何直觀『反向傳播演算法』之『速度』的乎?偶讀 colah’s blog 之文章 ,心想有益且介紹給讀者︰

Calculus on Computational Graphs: Backpropagation

Posted on August 31, 2015

Introduction

Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years.

Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names. In fact, the algorithm has been reinvented at least dozens of times in different fields (see Griewank (2010)). The general, application independent, name is “reverse-mode differentiation.”

Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential trick to have in your bag, not only in deep learning, but in a wide variety of numerical computing situations.

……

Conclusion

Derivatives are cheaper than you think. That’s the main lesson to take away from this post. In fact, they’re unintuitively cheap, and us silly humans have had to repeatedly rediscover this fact. That’s an important thing to understand in deep learning. It’s also a really useful thing to know in other fields, and only more so if it isn’t common knowledge.

Are there other lessons? I think there are.

Backpropagation is also a useful lens for understanding how derivatives flow through a model. This can be extremely helpful in reasoning about why some models are difficult to optimize. The classic example of this is the problem of vanishing gradients in recurrent neural networks.

Finally, I claim there is a broad algorithmic lesson to take away from these techniques. Backpropagation and forward-mode differentiation use a powerful pair of tricks (linearization and dynamic programming) to compute derivatives more efficiently than one might think possible. If you really understand these techniques, you can use them to efficiently calculate several other interesting expressions involving derivatives. We’ll explore this in a later blog post.

This post gives a very abstract treatment of backpropagation. I strongly recommend reading Michael Nielsen’s chapter on it for an excellent discussion, more concretely focused on neural networks.

───

 

由於作者不知那個『反向傳播演算法』之『大數趨近行為』 Big O notation

大O符號

大O符號英語:Big O notation)是用於描述函數漸近行為數學符號。更確切地說,它是用另一個(通常更簡單的)函數來描述一個函數數量級漸近上界。在數學中,它一般用來刻畫被截斷的無窮級數尤其是漸近級數的剩餘項;在計算機科學中,它在分析算法複雜性的方面非常有用。

大O符號是由德國數論學家保羅·巴赫曼(Paul Bachmann)在其1892年的著作《解析數論》(Analytische Zahlentheorie)首先引入的。而這個記號則是在另一位德國數論學家艾德蒙·朗道(Edmund Landau)的著作中才推廣的,因此它有時又稱為朗道符號(Landau symbols)。代表「order of …」(……階)的大O,最初是一個大寫的希臘字母Ο‘(omicron),現今用的是大寫拉丁字母O』,但從來不是阿拉伯數字『0』

───

f(x) \equiv O(g(x)) , \ \ \ if \ \ x \to \infty \ \  \frac{f(x)} {g(x)} \approx const.

 

故而無法多說,僅舉

Fast Fourier transform

A fast Fourier transform (FFT) algorithm computes the discrete Fourier transform (DFT) of a sequence, or its inverse. Fourier analysis converts a signal from its original domain (often time or space) to a representation in the frequency domain and vice versa. An FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse (mostly zero) factors.[1] As a result, it manages to reduce the complexity of computing the DFT from O(n^2), which arises if one simply applies the definition of DFT, to O(n \log n), where n is the data size.

Fast Fourier transforms are widely used for many applications in engineering, science, and mathematics. The basic ideas were popularized in 1965, but some algorithms had been derived as early as 1805.[2] In 1994 Gilbert Strang described the FFT as “the most important numerical algorithm of our lifetime”[3] and it was included in Top 10 Algorithms of 20th Century by the IEEE journal Computing in Science & Engineering.[4]

 Time_Signal_of_Five_Term_Cosine_Series
Time signal of a five term cosine series. Frequencies are multiples of 10 times sqrt(2). Amplitudes are 1, 2, 3, 4, 5. Time step is 0.001 s

───

 

為例講講『速度』重要性。假使 n = {10}^6 ,那麼 O(n^2) 之 DFT 大約需要 {10}^{12} 個計算,然而 O(n \cdot log(n)) 的 DFT 卻只不過要 {10}^6 \cdot 6 個計算,果真天差地別也耶!!??