Notice: Trying to access array offset on value of type bool in /home1/freesand/public_html/wp-content/plugins/wiki-embed/WikiEmbed.php on line 112

Notice: Trying to access array offset on value of type bool in /home1/freesand/public_html/wp-content/plugins/wiki-embed/WikiEmbed.php on line 112

Notice: Trying to access array offset on value of type bool in /home1/freesand/public_html/wp-content/plugins/wiki-embed/WikiEmbed.php on line 116
FreeSandal | 輕。鬆。學。部落客 | 第 214 頁

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】八

風俗通義》‧《窮通
《易》稱:「懸象著明,莫大乎於日月。」然時有昏晦。《詩》美:「滔滔江、漢,南北之紀。」然時有壅滯。《論語》:「固天縱之,莫盛於聖。」然時有困否。日月不失其體,故蔽而復明;江、漢不失其源,故窮而復通;聖久不失其德,故廢而復興。非唯聖人俾爾亶厚,夫有恆者亦允臻矣。是故君子厄窮而不閔,勞辱而不苟,樂天知命,無怨尤焉,故錄先否後喜曰《窮通》也。

 

於此篇章結尾之處, Michael Nielsen 先生忽講起『大圖全景』the big picture不知道這個『日月明象』是否可以得而見諸︰

Backpropagation: the big picture

As I’ve explained it, backpropagation presents two mysteries. First, what’s the algorithm really doing? We’ve developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It’s one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn’t mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I’ll address both these mysteries.

To improve our intuition about what the algorithm is doing, let’s imagine that we’ve made a small change \Delta w^l_{jk}k to some weight in the network, w^l_{jk}:

That change in weight will cause a change in the output activation from the corresponding neuron:

That, in turn, will cause a change in all the activations in the next layer:

Those changes will in turn cause changes in the next layer, and then the next, and so on all the way through to causing a change in the final layer, and then in the cost function:

The change \Delta C in the cost is related to the change \Delta w^l_{jk} in the weight by the equation

\Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}. \ \ \ \ (47)

This suggests that a possible approach to computing \frac{\partial C}{\partial w^l_{jk}} is to carefully track how a small change in w^l_{jk} propagates to cause a small change in C. If we can do that, being careful to express everything along the way in terms of easily computable quantities, then we should be able to compute \partial C / \partial w^l_{jk}.

Let’s try to carry this out. The change \Delta w^l_{jk} causes a small change \Delta a^l_{j} in the activation of the j^{\rm th} neuron in the l^{\rm th} layer. This change is given by

\Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \ \ \ \ (48)

The change in activation \Delta a^l_{j} will cause changes in all the activations in the next layer, i.e., the (l+1)^{\rm th} layer. We’ll concentrate on the way just a single one of those activations is affected, say a^{l+1}_q,

In fact, it’ll cause the following change:

\Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j. \ \ \ \ (49)

Substituting in the expression from Equation (48), we get:

\Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \ \ \ \ (50)

Of course, the change \Delta a^{l+1}_q will, in turn, cause changes in the activations in the next layer. In fact, we can imagine a path all the way through the network from w^l_{jk} to C, with each change in activation causing a change in the next activation, and, finally, a change in the cost at the output. If the path goes through activations a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m then the resulting expression is

\Delta C \approx \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}, \ \ \ \ (51)

that is, we’ve picked up a \partial a / \partial a type term for each additional neuron we’ve passed through, as well as the C/\partial a^L_m term at the end. This represents the change in C due to changes in the activations along this particular path through the network. Of course, there’s many paths by which a change in w^l_{jk} can propagate to affect the cost, and we’ve been considering just a single path. To compute the total change in C it is plausible that we should sum over all the possible paths between the weight and the final cost, i.e.,

\Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}, \ \ \ \ (52)

where we’ve summed over all possible choices for the intermediate neurons along the path. Comparing with (47) we see that

\frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}. \ \ \ \ (53)

Now, Equation (53) looks complicated. However, it has a nice intuitive interpretation. We’re computing the rate of change of C with respect to a weight in the network. What the equation tells us is that every edge between two neurons in the network is associated with a rate factor which is just the partial derivative of one neuron’s activation with respect to the other neuron’s activation. The edge from the first weight to the first neuron has a rate factor \partial a^{l}_j / \partial w^l_{jk}. The rate factor for a path is just the product of the rate factors along the path. And the total rate of change \partial C / \partial w^l_{jk} is just the sum of the rate factors of all paths from the initial weight to the final cost. This procedure is illustrated here, for a single path:

What I’ve been providing up to now is a heuristic argument, a way of thinking about what’s going on when you perturb a weight in a network. Let me sketch out a line of thinking you could use to further develop this argument. First, you could derive explicit expressions for all the individual partial derivatives in Equation (53). That’s easy to do with a bit of calculus. Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications. This turns out to be tedious, and requires some persistence, but not extraordinary insight. After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm! And so you can think of the backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.

Now, I’m not going to work through all this here. It’s messy and requires considerable care to work through all the details. If you’re up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.

What about the other mystery – how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier*

*There is one clever step required. In Equation (53) the intermediate variables are activations like a_q^{l+1}. The clever idea is to switch to using weighted inputs, like z^{l+1}_q, as the intermediate variables. If you don’t have this idea, and instead continue using the activations a^{l+1}_q, the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. – short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It’s just a lot of hard work simplifying the proof I’ve sketched in this section.

───

 

若有人果按圖索驥,追跡細查相干關係函式︰

\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \ \ \ \ (29)

C = C({\vec a}^L)

{\vec a}^L = \sigma ({\vec z}^L)

\sigma(z) \equiv \frac{1}{1+e^{-z}}

{\vec z}^{\ l} \ \equiv \  W^{\ l} \ {\vec a}^{\ l-1}  \ + \ {\vec b}^{\ l}

{\vec z}^{\ l+1} \ \equiv \  W^{\ l+1} \ {\vec a}^{\ l}  \ + \ {\vec b}^{\ l+1}

……

 

且已深曉『反向』精義︰

Next, we’ll prove (BP2), which gives an equation for the error \delta^l in terms of the error in the next layer, \delta^{l+1}. To do this, we want to rewrite \delta^l_j = \partial C / \partial z^l_j in terms of \delta^{l+1}_k = \partial C / \partial z^{l+1}_k. We can do this using the chain rule,

\delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \ \ \ \ (40 - 42)

……

where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of \delta^{l+1}_k. To evaluate the first term on the last line, note that

z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k. \ \ \ \ (43)

Differentiating, we obtain

\frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \ \ \ \ (44)

Substituting back into (42) we obtain

\delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \ \ \ \ (45)

This is just (BP2) written in component form.

───

 

或許僅需補之以『推演示意』︰

Derivation

Since backpropagation uses the gradient descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network. Assuming one output neuron,[note 2] the squared error function is:

E = \tfrac{1}{2}(t - y)^2,

where

E is the squared error,
t is the target output for a training sample, and
y is the actual output of the output neuron.

The factor of \textstyle\frac{1}{2} is included to cancel the exponent when differentiating. Later, the expression will be multiplied with an arbitrary learning rate, so that it doesn’t matter if a constant coefficient is introduced now.

For each neuron j, its output o_{j} is defined as

o_{j} = \varphi(\mbox{net}_{j}) = \varphi\left(\sum_{k=1}^{n}w_{kj}o_k\right).

The input \mbox{net}_{j} to a neuron is the weighted sum of outputs o_k of previous neurons. If the neuron is in the first layer after the input layer, the o_k of the input layer are simply the inputs x_k to the network. The number of input units to the neuron is n. The variable w_{ij} denotes the weight between neurons i and j.

The activation function \varphi is in general non-linear and differentiable. A commonly used activation function is the logistic function:

 \varphi(z) = \frac{1}{1+e^{-z}}

which has a nice derivative of:

 \frac {d \varphi}{d z}(z) = \varphi(z)(1-\varphi(z))

Finding the derivative of the error

Calculating the partial derivative of the error with respect to a weight w_{ij} is done using the chain rule twice:

\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\mathrm{net_j}} \frac{\partial \mathrm{net_j}}{\partial w_{ij}}

In the last term of the right-hand side of the following, only one term in the sum \mathrm{net_j} depends on w_{ij}, so that

\frac{\partial \mathrm{net_j}}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}}\left(\sum_{k=1}^{n}w_{kj}o_k\right) = o_i.

If the neuron is in the first layer after the input layer, o_i is just x_i.

The derivative of the output of neuron j with respect to its input is simply the partial derivative of the activation function (assuming here that the logistic function is used):

\frac{\partial o_j}{\partial\mathrm{net_j}} = \frac {\partial}{\partial \mathrm{net_j}}\varphi(\mathrm{net_j}) = \varphi(\mathrm{net_j})(1-\varphi(\mathrm{net_j}))

This is the reason why backpropagation requires the activation function to be differentiable.

The first term is straightforward to evaluate if the neuron is in the output layer, because then o_j = y and

\frac{\partial E}{\partial o_j} = \frac{\partial E}{\partial y} = \frac{\partial}{\partial y} \frac{1}{2}(t - y)^2 = y - t

However, if j is in an arbitrary inner layer of the network, finding the derivative E with respect to o_j is less obvious.

Considering E as a function of the inputs of all neurons L = {u, v, \dots, w} receiving input from neuron j,

\frac{\partial E(o_j)}{\partial o_j} = \frac{\partial E(\mathrm{net}_u, \mathrm{net}_v, \dots, \mathrm{net}_w)}{\partial o_j}

and taking the total derivative with respect to o_j, a recursive expression for the derivative is obtained:

\frac{\partial E}{\partial o_j} = \sum_{l \in L} \left(\frac{\partial E}{\partial \mathrm{net}_l}\frac{\partial \mathrm{net}_l}{\partial o_j}\right) = \sum_{l \in L} \left(\frac{\partial E}{\partial o_{l}}\frac{\partial o_{l}}{\partial \mathrm{net}_l}w_{jl}\right)

Therefore, the derivative with respect to o_j can be calculated if all the derivatives with respect to the outputs o_l of the next layer – the one closer to the output neuron – are known.

Putting it all together:

\dfrac{\partial E}{\partial w_{ij}} = \delta_{j} o_{i}

with

\delta_{j} = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\mathrm{net_j}} = \begin{cases} (o_{j}-t_{j})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an output neuron,}\\ (\sum_{l\in L} \delta_{l} w_{jl})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an inner neuron.} \end{cases}

To update the weight w_{ij} using gradient descent, one must choose a learning rate, \alpha. The change in weight, which is added to the old weight, is equal to the product of the learning rate and the gradient, multiplied by -1:

 \Delta w_{ij} = - \alpha \frac{\partial E}{\partial w_{ij}} = \begin{cases} - \alpha o_{i} (o_{j}-t_{j})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an output neuron,}\\ - \alpha o_{i} (\sum_{l\in L} \delta_{l} w_{jl})o_{j}(1-o_{j}) & \mbox{if } j \mbox{ is an inner neuron.} \end{cases}

The \textstyle -1 is required in order to update in the direction of a minimum, not a maximum, of the error function.

For a single-layer network, this expression becomes the Delta Rule. To better understand how backpropagation works, here is an example to illustrate it: The Back Propagation Algorithm, page 20.

───

 

終必得到乎??!!

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】七

為什麼一個演算法之『計算速度』值得大書特書的呢??

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm? To answer this question, let’s consider another approach to computing the gradient. Imagine it’s the early days of neural networks research. Maybe it’s the 1950s or 1960s, and you’re the first person in the world to think of using gradient descent to learn! But to make the idea work you need a way of computing the gradient of the cost function. You think back to your knowledge of calculus, and decide to see if you can use the chain rule to compute the gradient. But after playing around a bit, the algebra looks complicated, and you get discouraged. So you try to find another approach. You decide to regard the cost as a function of the weights C = C(w) alone (we’ll get back to the biases in a moment). You number the weights w_1, w_2, \ldots and want to compute \partial C / \partial w_j for some particular weight w_j. An obvious way of doing that is to use the approximation

C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon}, \ \ \ \ (46)

where \epsilon > 0 is a small positive number, and e_j is the unit vector in the j^{\rm th} direction. In other words, we can estimate \partial C / \partial w_j by computing the cost C for two slightly different values of w_j, and then applying Equation (46). The same idea will let us compute the partial derivatives \partial C / \partial b with respect to the biases.

This approach looks very promising. It’s simple conceptually, and extremely easy to implement, using just a few lines of code. Certainly, it looks much more promising than the idea of using the chain rule to compute the gradient!

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight w_j we need to compute C(w+\epsilon e_j) in order to compute \partial C / \partial w_j. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute C(w) as well, so that’s a total of a million and one passes through the network.

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives \partial C / \partial w_j using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass*

*This should be plausible, but it requires some analysis to make a careful statement. It’s plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it’s multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost..

And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes we needed for the approach based on (46)! And so even though backpropagation appears superficially more complex than the approach based on (46), it’s actually much, much faster.

This speedup was first fully appreciated in 1986, and it greatly expanded the range of problems that neural networks could solve. That, in turn, caused a rush of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s people ran up against limits, especially when attempting to use backpropagation to train deep neural networks, i.e., networks with many hidden layers. Later in the book we’ll see how modern computers and some clever new ideas now make it possible to use backpropagation to train such deep neural networks.

───

 

因為『實務機器』是『受限的』 bounded ,所以講『記憶體』、『輸出入快慢』、『CPU 速度』…… 都是『有限資源』!?如果說對比『理論之圖靈機』,它的『讀寫磁帶』是『無限的』!它的『狀態變遷速度』是無關緊要的!!

因此假使從『無限長』 \infty 之天外時間來看, 2^{1024} 年是否算久的呢 ??!!要是自『人的壽限』觀之 ,怕連 2^8 年都捱不了的吧!! ??此即是『正寫』為什麼人總得談那個『速度』的哩!!!或許這時『反思』『自然創造』,生物之『成長學習』卻一直是『按部就班』,豈不令人驚奇的耶???

如何直觀『反向傳播演算法』之『速度』的乎?偶讀 colah’s blog 之文章 ,心想有益且介紹給讀者︰

Calculus on Computational Graphs: Backpropagation

Posted on August 31, 2015

Introduction

Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years.

Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names. In fact, the algorithm has been reinvented at least dozens of times in different fields (see Griewank (2010)). The general, application independent, name is “reverse-mode differentiation.”

Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential trick to have in your bag, not only in deep learning, but in a wide variety of numerical computing situations.

……

Conclusion

Derivatives are cheaper than you think. That’s the main lesson to take away from this post. In fact, they’re unintuitively cheap, and us silly humans have had to repeatedly rediscover this fact. That’s an important thing to understand in deep learning. It’s also a really useful thing to know in other fields, and only more so if it isn’t common knowledge.

Are there other lessons? I think there are.

Backpropagation is also a useful lens for understanding how derivatives flow through a model. This can be extremely helpful in reasoning about why some models are difficult to optimize. The classic example of this is the problem of vanishing gradients in recurrent neural networks.

Finally, I claim there is a broad algorithmic lesson to take away from these techniques. Backpropagation and forward-mode differentiation use a powerful pair of tricks (linearization and dynamic programming) to compute derivatives more efficiently than one might think possible. If you really understand these techniques, you can use them to efficiently calculate several other interesting expressions involving derivatives. We’ll explore this in a later blog post.

This post gives a very abstract treatment of backpropagation. I strongly recommend reading Michael Nielsen’s chapter on it for an excellent discussion, more concretely focused on neural networks.

───

 

由於作者不知那個『反向傳播演算法』之『大數趨近行為』 Big O notation

大O符號

大O符號英語:Big O notation)是用於描述函數漸近行為數學符號。更確切地說,它是用另一個(通常更簡單的)函數來描述一個函數數量級漸近上界。在數學中,它一般用來刻畫被截斷的無窮級數尤其是漸近級數的剩餘項;在計算機科學中,它在分析算法複雜性的方面非常有用。

大O符號是由德國數論學家保羅·巴赫曼(Paul Bachmann)在其1892年的著作《解析數論》(Analytische Zahlentheorie)首先引入的。而這個記號則是在另一位德國數論學家艾德蒙·朗道(Edmund Landau)的著作中才推廣的,因此它有時又稱為朗道符號(Landau symbols)。代表「order of …」(……階)的大O,最初是一個大寫的希臘字母Ο‘(omicron),現今用的是大寫拉丁字母O』,但從來不是阿拉伯數字『0』

───

f(x) \equiv O(g(x)) , \ \ \ if \ \ x \to \infty \ \  \frac{f(x)} {g(x)} \approx const.

 

故而無法多說,僅舉

Fast Fourier transform

A fast Fourier transform (FFT) algorithm computes the discrete Fourier transform (DFT) of a sequence, or its inverse. Fourier analysis converts a signal from its original domain (often time or space) to a representation in the frequency domain and vice versa. An FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse (mostly zero) factors.[1] As a result, it manages to reduce the complexity of computing the DFT from O(n^2), which arises if one simply applies the definition of DFT, to O(n \log n), where n is the data size.

Fast Fourier transforms are widely used for many applications in engineering, science, and mathematics. The basic ideas were popularized in 1965, but some algorithms had been derived as early as 1805.[2] In 1994 Gilbert Strang described the FFT as “the most important numerical algorithm of our lifetime”[3] and it was included in Top 10 Algorithms of 20th Century by the IEEE journal Computing in Science & Engineering.[4]

 Time_Signal_of_Five_Term_Cosine_Series
Time signal of a five term cosine series. Frequencies are multiples of 10 times sqrt(2). Amplitudes are 1, 2, 3, 4, 5. Time step is 0.001 s

───

 

為例講講『速度』重要性。假使 n = {10}^6 ,那麼 O(n^2) 之 DFT 大約需要 {10}^{12} 個計算,然而 O(n \cdot log(n)) 的 DFT 卻只不過要 {10}^6 \cdot 6 個計算,果真天差地別也耶!!??

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】六

俗話說︰千里來龍,在此結穴。 Michael Nielsen 先生這段文字

The backpropagation algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let’s explicitly write this out in the form of an algorithm:

  1. Input x : Set the corresponding activation a^l for the input layer.
  2. Feedforward: For each l = 2, 3, \ldots, L compute z^{l} = w^l a^{l-1}+b^l and a^{l} = \sigma(z^{l}).
  3. Output error \delta^L : Compute the vector \delta^{L} = \nabla_a C \odot \sigma'(z^L).
  4. Backpropagate the error: For each l = L-1, L-2, \ldots, 2 compute \delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^{l}).
  5. Output: The gradient of the cost function is given by \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j and \frac{\partial C}{\partial b^l_j} = \delta^l_j.

Examining the algorithm you can see why it’s called backpropagation. We compute the error vectors δl backward, starting from the final layer. It may seem peculiar that we’re going through the network backward. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.

……

As I’ve described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, C = C_x. In practice, it’s common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

  1. Input a set of training examples
  2. For each training example  x : Set the corresponding input activation a^{x,1, and perform the following steps:
    • Feedforward: For each l = 2, 3, \ldots, L compute z^{x,l} = w^l a^{x,l-1}+b^l and a^{x,l} = \sigma(z^{x,l}).
    • Output error \delta^{x,L} : Compute the vector \delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L}).
    • Backpropagate the error: For each l = L-1, L-2, \ldots, 2 compute \delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,l}).
  3. Gradient descent: For each l = L, L-1, \ldots, 2 update the weights according to the rule w^l \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T, and the biases according to the rule b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}.

Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. I’ve omitted those for simplicity.

───

 

或許讀來輕鬆快意!就連程式碼

The code for backpropagation

Having understood backpropagation in the abstract, we can now understand the code used in the last chapter to implement backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Network class. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batch method updates the Network‘s weights and biases by computing the gradient for the current mini_batch of training examples:

class Network(object):
...
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

 

Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y) which uses the backprop method to figure out the partial derivatives \partial C_x / \partial b^l_j and \partial C_x / \partial w^l_{jk}. The backprop method follows the algorithm in the last section closely. There is one small change – we use a slightly different approach to indexing the layers. This change is made to take advantage of a feature of Python, namely the use of negative list indices to count backward from the end of a list, so, e.g., l[-3] is the third last entry in a list l. The code for backprop is below, together with a few helper functions, which are used to compute the \sigma function, the derivative \sigma', and the derivative of the cost function. With these inclusions you should be able to understand the code in a self-contained way. If something’s tripping you up, you may find it helpful to consult the original description (and complete listing) of the code.

class Network(object):
...
   def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

...

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y) 

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

───

 

也彷彿一目瞭然!!當此之際,需注意『去脈問題』?一如之前

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】六》文本

所講,這個『Stochastic gradient descent』方法是有『假設前提』的,那個『學習率』 \eta 和『批量值』 m 都是『超參數』,如何適當選取,尚須慎思與明辨耶??

此刻讀點『天外飛來一筆』□○ 之文,或有益乎??!!

在《λ 運算︰淵源介紹》一文中,我們談過『函數y = f(x) = x^2 中的『變數x 是『虛名的』,它是函數『定義域』的『輸入值』,而那個 y 是函數『對應域』之『輸出值』。這個『函數』是由『計算規則\Box^2 來確定的,把它叫做 f(x) 或者 g(w), \ z = g(w) = w^2 都指相同的『函數』,因此『函數』就其『抽象定義』來講是『匿名的』。於是所謂的『命名』也只是方便『指稱』或『分別』不同的『函數』罷了!當一個『函數F 的『定義域』中的『元素h 也是『函數』時,此時就很容易發生了『混淆』,它是指一個『定義域』是『變數x 的『複合函數H = F \circ h, \ H(x) = F(h(x)),還是指『定義域』是『函數h(x)  之『泛函數』 functional 的呢?比方說 H(t) = \int_{0}^{t} h(x) dx,也許最好將此類的『泛函數』稱之為『函式』較好。如此當一個含有『未知函數』或者『隱函數』的『複合函數』之方程式被叫做『 泛函數方程式』Functional equation 時,也就能夠作個清楚『區分』的了!!

那麼『 泛函數方程式』有什麼『重要性』的嗎?就讓我們從一個『例子』開始談吧︰

f(x + y) = f(x) \cdot f(y)

。就是說有個『未知函數f  將『定義域』中『兩數的加法x + y 轉換成了『對應域』裡『兩數之乘法』  f(x) \cdot f(y)。這個『未知函數』的『整體性質』就由這個『方程式』 來決定,比方講,這個函數 f(0) = 1,為什麼呢?因為 f(0) = f(0 + 0) = f(0) \cdot f(0) = {f(0)}^2,所以 f(0) = 0f(0) = 1,又為什麼不取 f(0) = 0 的呢?因為如果取  f(0) = 0 的話,f(x) = f(x + 0) = f(x) \cdot f(0) = 0 只是個『零函數』罷了,一般叫做『平凡解』 trivial solution,雖然它為了『完整性』不能夠『被省略』,然而這個解『太顯然』的了 ── 0 + 0 = 0 \cdot 0,似乎不必『再說明』的吧!不過有時這種『論證』的傳統給『初學者』帶來了『理解』的『困難』,故此特別指出。要怎麼『求解』 泛函數方程式的呢?一般說來『非常困難』,有時甚至可能無法得知它到底是有『一個解』、『多個解』、『無窮解』或者根本就『無解』!!近年來漸漸的產生了一些『常用方法』,比方講『動態規劃』 Dynamic programming 中使用『分割征服』的『連續逼近法』或是將原方程式『變換』成適合『定點迭代法』等等。在此我們將介紹『無窮小分析』的『應用』,談一點對於一個『平滑函數f(x) 來說,可以如何『思考』這種方程式的呢?由於 f(x) 是『平滑的』,因此 f(x) 它『可微分』,『導數f^{\prime}(x) 存在而且『連續』,那麼 f(x + \delta x) 就可以表示為

f(x + \delta x) = f(x) + f^{\prime}(x) \cdot \delta x + \epsilon \cdot \delta x

, 此處 \delta x \approx 0 \Longrightarrow \epsilon \approx 0。由於 f(x + y) = f(x) \cdot f(y),可得 f(x + \delta x) = f(x) \cdot f(\delta x),因此 f(x + \delta x) = f(x) \cdot f(\delta x) = f(x) + f^{\prime}(x) \cdot \delta x + \epsilon \cdot \delta x,所以

\frac{f^{\prime}(x)}{f(x)} = \frac{f(\delta x) - 1}{\delta x} - \epsilon

。因為 f(x) 是『平滑的』,那麼 \frac{f^{\prime}(x)}{f(x)} 將存在且連續,然而 \frac{f(\delta x) - 1}{\delta x} - \epsilon 與『變數x 無關,於是當 \delta x \approx 0 時,必定是某個『常數』 constant ,將之命作 k,如此那個『 泛函數方程式』就被改寫成了『微分方程式

f^{\prime}(x) = k \cdot f(x), \ f(0) = 1

。它的『』是 f(x) = e^{k \cdot x}, \ k \neq 0,為什麼 k \neq 0 的呢?因為此時 f^{\prime}(x) = 0,將會得到 f(x) 是個『常數函數』,這就是前面說的那個『平凡解』 的啊!!

乍一看,這個『解法』果真『奇妙』,竟然變成了『微分方程式』的了!假使細察 f(x + y) = f(x) \cdot f(y) 可說是『函數f(x) 的『大域性質』,但是 f(x + \delta x) = f(x) + f^{\prime}(x) \cdot \delta x + \epsilon \cdot \delta x 卻描述『函數f(x) 之『微觀鄰域』,將此兩者合併考慮,果真是『可以的』嗎?如果說一個函數的『大域性質』自然『含括\Longrightarrow 了它的『微觀鄰域』之『描述』,由於它是一個『平滑的』函數,那麼『微觀鄰域』的『微分方程式』難道不能『整合』 integrate \Longrightarrow 成那個『大域性質』的嗎?

─── 摘自《【Sonic π】電聲學之電路學《四》之《一》

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】五

人能否從歷史人物身上學習什麼呢?比方說維基百科『高斯』詞條如是紀載︰

約翰·卡爾·弗里德里希·高斯德語:Johann Karl Friedrich Gauß關於這個音頻文件 

英語:Gauss拉丁語Carolus Fridericus Gauss;1777年4月30日-1855年2月23日), 德國著名數學家物理學家天文學家大地測量學家,生於布倫瑞克,卒於哥廷根 。高斯被認為是歷史上最重要的數學家之一[1],並有「數學王子」的美譽[2]

1792年,15歲的高斯進入Collegium Carolinum,現今的布倫瑞克科技大學(Braunschweig University of Technology)。在那裡,高斯開始對高等數學作研究。獨立發現了二項式定理的一般形式、數論上的「二次互反律」、質數定理、及算術-幾何平均數[3]

1795年高斯進入哥廷根大學。1796年,19歲的高斯得到了一個數學史上極重要的結果,就是《正十七邊形尺規作圖之理論與方法》。

1855年2月23日清晨,77歲的高斯在格於丁根天文台的躺椅上睡覺時去世。[4]

Bendixen_-_Carl_Friedrich_Gauß,_1828

1838年出版的天文學通報中高斯肖像。

生平

高斯是一對普通夫婦的兒子。他的母親是一個貧窮石匠的女兒,雖然十分聰明,但卻沒有接受過教育,近似於文盲。在她成為高斯父親的第二個妻子之前,她從事女傭工作。他的父親曾做過工頭,商人的助手和一個小保險公司的評估師。高斯三歲時便能夠糾正他父親的借債帳目的事情,已經成為一個軼事流傳至今。他曾說,他能夠在腦袋中進行複雜的計算。他的職業是園丁,他做事認真。

高斯10歲時利用很短的時間就計算出了小學老師提出的問題:自然數從1到100的求和。他所使用的方法是:對50對構造成和101的數列求和(1+100,2+99,3+98……),同時得到結果:5050。

───

 

這位一代之數學天才有著非凡的洞察力!能用不俗的觀點處理數學計算問題!!傳聞所說的老師之刁難︰計算

1+2+3+ \cdots + 99 +100

,終因『巧妙』之法而迅速落幕也︰

Sum = 1+2+3+ \cdots + 99 +100

Sum = 100+99+ \cdots +3+2 +1

2 \cdot Sum = (1+100) + (2+99) + \cdots + (99+2) + (100+1)

= (101) \times 100

\therefore Sum = \frac{101 \times 100}{2} = 5050

 

但是此『巧妙』之法難以『推廣』,舉例說︰計算

1^2 + 2^2 + 3^2 + \cdots + {99}^2 +{100}^2

無法運作此法也??!!難道還有更『一般性』的法子嗎??假使設想一個『序列』 S_n = n^2 ,可知 S_{n+1} - S_n = {(n+1)}^2 - n^2 = 2 \cdot n +1 ,因此

S_2 - S_1 = 2 \cdot 1 +1

S_3 - S_2 = 2 \cdot 2 +1

……

S_n - S_{n-1} = 2 \cdot (n-1) + 1

S_{n+1} - S_n = 2 \cdot n +1

\therefore S_{n+1} - S_1 = 2 \cdot (1+2+ \cdots + (n-1) + n) + n

所以

1+2+ \cdots + (n-1) + n = \frac{S_{n+1} - S_1 - n}{2} = \frac{n \cdot (n+1)} {2}

 

要是依樣畫葫蘆構造『序列』 C_n = n^3 , 那麼 C_{n+1} - C_n = {(n+1)}^3 - n^3 = 3 \cdot n^2 + 3 \cdot n + 1 ,故而

C_2 - C_1 = 3 \cdot 1^2 + 3 \cdot 1 +1

C_3 - C_2 = 3 \cdot 2^2 + 3 \cdot 2 +1

……

C_n - C_{n-1} = 3 \cdot {(n-1)}^2 + 3 \cdot (n-1) +1

C_{n+1} - C_n = 3 \cdot n^2 + 3 \cdot n +1

\therefore C_{n+1} - C_1 = 3 \cdot (1^2 + 2^2 + \cdots + n^2) + 3 \cdot (1+2+ \cdots +n) + n

於是

1^2 + 2^2 + \cdots + n^2 = \frac{C_{n+1} - C_1 - 3 \cdot (1+2+ \cdots +n) - n}{3} = \frac{n \cdot (n+1) \cdot (2 \cdot n +1)}{6}

 

果真可以『類推』耶!!??反之我們曉得 1^2 + 2^2 + \cdots + n^2 必為整數,因此 n \cdot (n+1) \cdot (2 \cdot n +1) 必是六的倍數,但將要如何證明的呢???故知方法有窮而無盡!應用無窮實難盡!!終究運作之道祇存乎一心矣!!!

作者之所以寫這一段前言,是想說明『反向傳播演算法』不外乎是依據『定義』

\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \ \ \ \ (29)

,使用『鏈式法則』有『目的』之『推導』罷了??

tikz21

 

如能清清楚楚了解此『目的』,『推導』方向的『選擇』也就明明白白,不欲做『多餘之計算』也!從『已知』推向『未知』的呀 !!

層與層之間有關係

{\vec z}^{\ l} \ \equiv \  W^{\ l} \ {\vec a}^{\ l-1}  \ + \ {\vec b}^{\ l}

可以『向前瞧』或『向後看』,既然名之為『反向』欲『向後看』也。為什麼的呢?因為我們僅知『最終誤差』是

C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2, \ \ \ \ (26)

,我們想透過改變『之前』的『權重』W^{\ l}  和『基底』 {\vec b} ^l 來『減少』此『誤差』,『向後看』才能知道將改變多少的乎!!

因此當 Michael Nielsen 先生證明式子 BP2 時,首先明示此『目的』

Next, we’ll prove (BP2), which gives an equation for the error \delta^l in terms of the error in the next layer, \delta^{l+1}. To do this, we want to rewrite \delta^l_j = \partial C / \partial z^l_j in terms of \delta^{l+1}_k = \partial C / \partial z^{l+1}_k. We can do this using the chain rule,

\delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \ \ \ \ (40 - 42)
───

 

,實是畫龍點睛之筆法也。

由於 BP3 、BP4 兩式之證明給當成了『習題』,此處就算『題解』的吧。為方便閱讀,列出相干關係函式如下︰

C = C({\vec a}^L)

{\vec a}^L = \sigma ({\vec z}^L)

\sigma(z) \equiv \frac{1}{1+e^{-z}}

{\vec z}^{\ l} \ \equiv \  W^{\ l} \ {\vec a}^{\ l-1}  \ + \ {\vec b}^{\ l}

\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \ \ \ \ (29)

 

【BP3】

\frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z^l_j}  \cdot \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j \cdot 1 = \delta^l_j

 

【BP4】

\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^l_j} \cdot \frac{\partial z^l_j}{\partial w^l_{jk}} = \delta^l_j \cdot a^{l-1}_k = a^{l-1}_k \cdot \delta^l_j

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】四

Michael Nielsen 先生此段文字是本章的重點,他先從

The Hadamard product,s \odot t

The backpropagation algorithm is based on common linear algebraic operations – things like vector addition, multiplying a vector by a matrix, and so on. But one of the operations is a little less commonly used. In particular, suppose s and t are two vectors of the same dimension. Then we use s \odot t to denote the elementwise product of the two vectors. Thus the components of s \odot t are just (s \odot t)_j = s_j t_j. As an example,

\left[\begin{array}{c} 1 \\ 2 \end{array}\right] \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right] = \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right] = \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].  \ \ \ \  (28)

This kind of elementwise multiplication is sometimes called the Hadamard product or Schur product. We’ll refer to it as the Hadamard product. Good matrix libraries usually provide fast implementations of the Hadamard product, and that comes in handy when implementing backpropagation.

───

 

講起,是希望藉著『記號法』,方便讀者理解『反向傳播算法』的內容。這個稱作 Hadamard product 之『對應元素乘法』

Hadamard product (matrices)

In mathematics, the Hadamard product (also known as the Schur product[1] or the entrywise product[2]) is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element ij is the product of elements ij of the original two matrices. It should not be confused with the more common matrix product. It is attributed to, and named after, either French mathematician Jacques Hadamard, or German mathematician Issai Schur.

The Hadamard product is associative and distributive, and unlike the matrix product it is also commutative.

220px-Hadamard_product_qtl1.svg

The Hadamard product operates on identically-shaped matrices and produces a third matrix of the same dimensions.

Definition

For two matrices, A, B, of the same dimension, m \times n the Hadamard product, A \circ B, is a matrix, of the same dimension as the operands, with elements given by

(A \circ B)_{i,j} = (A)_{i,j} \cdot (B)_{i,j}.

For matrices of different dimensions (m \times n and p \times q, where m \not= p or n \not= q or both) the Hadamard product is undefined.

Example

For example, the Hadamard product for a 3×3 matrix A with a 3×3 matrix B is:

 \left(\begin{array}{ccc} \mathrm{a}_{11} & \mathrm{a}_{12} & \mathrm{a}_{13}\\ \mathrm{a}_{21} & \mathrm{a}_{22} & \mathrm{a}_{23}\\ \mathrm{a}_{31} & \mathrm{a}_{32} & \mathrm{a}_{33} \end{array}\right) \circ \left(\begin{array}{ccc} \mathrm{b}_{11} & \mathrm{b}_{12} & \mathrm{b}_{13}\\ \mathrm{b}_{21} & \mathrm{b}_{22} & \mathrm{b}_{23}\\ \mathrm{b}_{31} & \mathrm{b}_{32} & \mathrm{b}_{33} \end{array}\right) = \left(\begin{array}{ccc} \mathrm{a}_{11}\, \mathrm{b}_{11} & \mathrm{a}_{12}\, \mathrm{b}_{12} & \mathrm{a}_{13}\, \mathrm{b}_{13}\\ \mathrm{a}_{21}\, \mathrm{b}_{21} & \mathrm{a}_{22}\, \mathrm{b}_{22} & \mathrm{a}_{23}\, \mathrm{b}_{23}\\ \mathrm{a}_{31}\, \mathrm{b}_{31} & \mathrm{a}_{32}\, \mathrm{b}_{32} & \mathrm{a}_{33}\, \mathrm{b}_{33} \end{array}\right)

───

 

明晰易解,大概無需畫蛇添足的吧!然而接下來的『妖怪說』︰

The four fundamental equations behind backpropagation

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives \partial C / \partial w^l_{jk} and \partial C / \partial b^l_j. But to compute those, we first introduce an intermediate quantity, \delta^l_j, which we call the error in the j^{\rm th} neuron in the l^{\rm th} layer. Backpropagation will give us a procedure to compute the error \delta^l_j, and then will relate \delta^l_j to \partial C / \partial w^l_{jk} and \partial C / \partial b^l_j.

To understand how the error is defined, imagine there is a demon in our neural network:

The demon sits at the j^{\rm th} neuron in layer l. As the input to the neuron comes in, the demon messes with the neuron’s operation. It adds a little change \Delta z^l_j to the neuron’s weighted input, so that instead of outputting \sigma(z^l_j), the neuron instead outputs \sigma(z^l_j+\Delta z^l_j). This change propagates through later layers in the network, finally causing the overall cost to change by an amount \frac{\partial C}{\partial z^l_j} \Delta z^l_j.

Now, this demon is a good demon, and is trying to help you improve the cost, i.e., they’re trying to find a \Delta z^l_j which makes the cost smaller. Suppose \frac{\partial C}{\partial z^l_j} has a large value (either positive or negative). Then the demon can lower the cost quite a bit by choosing \Delta z^l_j to have the opposite sign to \frac{\partial C}{\partial z^l_j}. By contrast, if \frac{\partial C}{\partial z^l_j} is close to zero, then the demon can’t improve the cost much at all by perturbing the weighted input z^l_j. So far as the demon can tell, the neuron is already pretty near optimal*

*This is only the case for small changes \Delta z^l_j, of course. We’ll assume that the demon is constrained to make such small changes..

And so there’s a heuristic sense in which \frac{\partial C}{\partial z^l_j} is a measure of the error in the neuron.

Motivated by this story, we define the error \delta^l_j of neuron j in layer l by

\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \ \ \ \ (29)

As per our usual conventions, we use \delta^l to denote the vector of errors associated with layer l. Backpropagation will give us a way of computing \delta^l for every layer, and then relating those errors to the quantities of real interest, \partial C / \partial w^l_{jk} and \partial C / \partial b^l_j.

You might wonder why the demon is changing the weighted input z^l_j. Surely it’d be more natural to imagine the demon changing the output activation a^l_j, with the result that we’d be using \frac{\partial C}{\partial a^l_j} as our measure of error. In fact, if you do this things work out quite similarly to the discussion below. But it turns out to make the presentation of backpropagation a little more algebraically complicated. So we’ll stick with \delta^l_j = \frac{\partial C}{\partial z^l_j} as our measure of error*

*In classification problems like MNIST the term “error” is sometimes used to mean the classification failure rate. E.g., if the neural net correctly classifies 96.0 percent of the digits, then the error is 4.0 percent. Obviously, this has quite a different meaning from our \delta vectors. In practice, you shouldn’t have trouble telling which meaning is intended in any given usage..

───

 

確須仔細斟詳,方可確實明白,達成為我所用之目的。或因 Michael Nielsen 先生不能假設讀者知道『矩陣』、『向量』 以及『無窮小』之『萊布尼茲記號法』︰

Leibniz’s notation

In calculus, Leibniz’s notation, named in honor of the 17th-century German philosopher and mathematician Gottfried Wilhelm Leibniz, uses the symbols dx and dy to represent infinitely small (or infinitesimal) increments of x and y, respectively, just as Δx and Δy represent finite increments of x and y, respectively.[1]

Consider y as a function of a variable x, or y = f(x). If this is the case, then the derivative of y with respect to x, which later came to be viewed as the limit

\lim_{\Delta x\rightarrow 0}\frac{\Delta y}{\Delta x} = \lim_{\Delta x\rightarrow 0}\frac{f(x + \Delta x)-f(x)}{\Delta x},

was, according to Leibniz, the quotient of an infinitesimal increment of y by an infinitesimal increment of x, or

\frac{dy}{dx}=f'(x),

where the right hand side is Joseph-Louis Lagrange’s notation for the derivative of f at x. From the point of view of modern infinitesimal theory, Δx is an infinitesimal x-increment, Δy is the corresponding y-increment, and the derivative is the standard part of the infinitesimal ratio:

f'(x)={\rm st}\Bigg( \frac{\Delta y}{\Delta x} \Bigg).

Then one sets dx=\Delta x, dy = f'(x) dx\,, so that by definition, f'(x)\, is the ratio of dy by dx.

Similarly, although mathematicians sometimes now view an integral

\int f(x)\,dx

as a limit

\lim_{\Delta x\rightarrow 0}\sum_{i} f(x_i)\,\Delta x,

where Δx is an interval containing xi, Leibniz viewed it as the sum (the integral sign denoting summation) of infinitely many infinitesimal quantities f(xdx. From the modern viewpoint, it is more correct to view the integral as the standard part of such an infinite sum.

250px-Gottfried_Wilhelm_Leibniz_c1700

Gottfried Wilhelm von Leibniz (1646 – 1716), German philosopher, mathematician, and namesake of this widely used mathematical notation in calculus.

───

 

於是方用迂迴的方式來說。如果回顧相干函式關係︰

C = C({\vec a}^L)

{\vec a}^L = \sigma ({\vec z}^L)

\sigma(z) \equiv \frac{1}{1+e^{-z}}

{\vec z}^{\ l} \ \equiv \  W^{\ l} \ {\vec a}^{\ l-1}  \ + \ {\vec b}^{\ l}

也許會發現之所以選擇 C = C({\vec z}^L) 自然而然順理成章。若是通熟『鏈式法則

Chain rule

In calculus, the chain rule is a formula for computing the derivative of the composition of two or more functions. That is, if f and g are functions, then the chain rule expresses the derivative of their composition fg (the function which maps x to f(g(x)) in terms of the derivatives of f and g and the product of functions as follows:

(f\circ g)'=(f'\circ g)\cdot g'.

This can be written more explicitly in terms of the variable. Let F = fg, or equivalently, F(x) = f(g(x)) for all x. Then one can also write

F'(x) = f'(g(x)) g'(x).

The chain rule may be written, in Leibniz’s notation, in the following way. We consider z to be a function of the variable y, which is itself a function of x (y and z are therefore dependent variables), and so, z becomes a function of x as well:

\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}.

In integration, the counterpart to the chain rule is the substitution rule.

……
 

Proof via infinitesimals

If y=f(x) and x=g(t) then choosing infinitesimal \Delta t\not=0 we compute the corresponding \Delta x=g(t+\Delta t)-g(t) and then the corresponding \Delta y=f(x+\Delta x)-f(x), so that

\frac{\Delta y}{\Delta t}=\frac{\Delta y}{\Delta x} \frac{\Delta x}{\Delta t}

and applying the standard part we obtain

\frac{d y}{d t}=\frac{d y}{d x} \frac{dx}{dt}

which is the chain rule.

───

 

該文本之內容會是天經地義無可疑議矣??!!由於『微積分』是閱讀科技著作和文獻之『必修課』與『先修課』,它的深入理解以及直覺掌握,將有莫大價值,故而作者曾假《觀水》一文中京房之『上飛下飛』十六變之法,寫了

【Sonic π】電路學之補充《四》無窮小算術‧…》一系列文本,期望能助益於萬一也!!??

最後補以『觀察點』之選擇往往會造成不同的『難易度』!!!

Concrete_Mathematics_-_Cover

具體數學:計算機科學之基石

An ant starts to crawl along a taut rubber rope 1 km long at a speed of 1 cm per second (relative to the rubber it is crawling on). At the same time, the rope starts to stretch uniformly by 1 km per second, so that after 1 second it is 2 km long, after 2 seconds it is 3 km long, etc. Will the ant ever reach the end of the rope?

Lambda-Cold_Dark_Matter,_Accelerated_Expansion_of_the_Universe,_Big_Bang-Inflation

Ant_Receives_Honeydew_from_Aphid

250px-RubberBand

220px-Rubber_bands_-_Colors_-_Studio_photo_2011

小螞蟻

在『具體數學』這本書中提到了另一個『違反直覺』的例子 ──  橡膠繩上的螞蟻 ──,一隻螞蟻以每秒鐘一公分的速度,在一根長一公里拉緊了的橡膠繩上爬行【螞蟻的爬行速度相對橡膠繩】,就在螞蟻爬行的同時,橡膠繩也以每秒一公里的速度伸長,也就是說一秒後,它有兩公里長,兩秒後有三公里長等等。試問螞蟻果真到的了繩子之另一端的嗎?

假設 t 時刻時,螞蟻在 x(t), x_0 = x(0) 的『位置』,以 \alpha 的速度朝另一端爬行,同時橡膠繩用 v 的速度同向均勻伸長,因此橡膠繩上的某點 X 的伸長速度將是 \frac{X}{c + v t} v,此處 c 就是橡膠繩的『原長』,由於起點是『固定』不動的,因此繩上各點的速度與『端點的速度v = \frac{c + v t}{c + v t} v 成比例,於是對於一個靜止的觀察者而言,螞蟻的速度將是

\frac{dx(t)}{dt} = x'(t) = \alpha+\frac{x(t)}{c+vt} v

。 如果我們設想要是橡膠繩並不與時伸展,那麼這隻螞蟻是一定到的了彼端,假使一個相對於繩子靜止的觀察者將看到什麼現象的呢?或許他將用 \Phi (t) = \frac{x(t)}{c + v t} 來描述這隻螞蟻的行進的吧! 它從 t=0, \Phi(0) = 0 開始爬行,它將會在 t=T, \Phi(T)=1 時刻到達了另一端。由於

\frac{d\Phi}{dt} = \frac{d}{dt} \left[ {(c + v t)}^{-1} \cdot x(t) \right]
= {(c + v t)}^{-1} \cdot \frac{dx(t)}{dt} - {(c + v t)}^{-2} \cdot v x(t)
= {(c + v t)}^{-1} \cdot \left[ \alpha + {(c + v t)}^{-1}  \cdot v x(t) \right] - {(c + v t)}^{-2} \cdot v x(t)
= \frac{\alpha}{c + v t},因此我們就可以得到

\Phi(t)=\int{\frac{\alpha}{c+vt}\,dt}=\frac{\alpha}{v}\log(c+vt)+\kappa

此 處 \kappa 積分常數。從 \Phi(0)=0,可以得到 \kappa=-\frac{\alpha}{v}\log{c},所以 \Phi(t)=\frac{\alpha}{v}\log{\left(\frac{c+vt}{c}\right)}。再從 \Phi(T)=1,就得到了 T=\frac{c}{v}\left(e^{v/\alpha}-1\right)

假使將 c = 1 km, v = 1 km/sec, \alpha = 1 cm/sec 代入後,求得 T=(e^{100,000}-1) \ \mathrm{s}\ \approx2.8\times10^{43,429}\,\mathrm{s}這隻小螞蟻果真到的了彼端,雖然它得經過千千萬萬個三千大劫的吧!!

那麼為什麼 ︰某點 X 的伸長速度會是 \frac{X}{c + v t} v 的呢?假使考慮一根『有刻度p_i, i = 0 \cdots n 的尺,它的一端固定,另一端向外延伸,這樣每個『刻度點』也都相對的向外伸長 {\delta}_i ,於是『固定端{\delta}_0 =0,『第一個刻度』在 p_1 + {\delta}_1 的位置,『第二個刻度』在 p_2 + {\delta}_1+ {\delta}_2 的位置,這樣『k 個刻度』將在 p_k + \sum \limits_{j=1}^{j=k} {\delta}_j 的位置。假設『橡膠繩』的『彈性』是『均勻』的,這樣所有的 {\delta}_i = \delta x  都相等,因此 p_k + \sum \limits_{j=1}^{j=k} {\delta}_j = p_k + k \delta x,如此當『另一端』用『定速v 移動時,此時 n \delta x = v \delta t,於是『k 個刻度』就用 \frac{k}{n} v 的速度在移動的啊!

過去有人曾經想像過一種情況︰如果說宇宙中的一切,在夜間會突然的『變大』或者『縮小』,那麼我們能夠『發現』的嗎?假使從『量測』的觀點來看,如果度量用的『l_{\lambda} = \lambda l_0  隨著時間改變,『被度量事物』的『大小S_0 也『協同』的跟隨著變化 S(t) = \lambda \cdot S_0,這樣在那個『宇宙』 裡面 \frac{S(t)}{l_{\lambda} } = \frac{S_0}{l_0},也就是講『測量』沒有辦法『知道』到底『有沒有』發生過這件事的啊!!然而這卻更進一步引發了『自然定律』到底是否會在『度量單位』的改變下,而有所不同的呢??

── 摘自《【Sonic π】電路學之補充《四》無窮小算術‧中下中‧中

 

 

 

 

 

 

 

 

 

 

 

 

 

 

輕。鬆。學。部落客