W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】六

此處 Michael Nielsen 先生談起『理論』上可行的方法,『實務』上可能會不管用!!??

People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of C, and this can be quite costly. To see why it’s costly, suppose we want to compute all the second partial derivatives \partial^2 C/ \partial v_j \partial v_k. If there are a million such v_j variables then we’d need to compute something like a trillion (i.e., a million squared) second partial derivatives*

*Actually, more like half a trillion, since \partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j. Still, you get the point.!

That’s going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we’ll use gradient descent (and variations) as our main approach to learning in neural networks.

How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights wk and biases bl which minimize the cost in Equation (6). To see how this works, let’s restate the gradient descent update rule, with the weights and biases replacing the variables v_j. In other words, our “position” now has components w_k and b_l, and the gradient vector \nabla C has corresponding components \partial C / \partial w_k and \partial C / \partial b_l. Writing out the gradient descent update rule in terms of components, we have

w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k}  \ \ \ \ \ (16)
b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l} \ \ \ \ \ (17) .

 

By repeatedly applying this update rule we can “roll down the hill”, and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.

There are a number of challenges in applying the gradient descent rule. We’ll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let’s look back at the quadratic cost in Equation (6). Notice that this cost function has the form C = \frac{1}{n} \sum_x C_x, that is, it’s an average over costs C_x \equiv \frac{\|y(x)-a\|^2}{2} for individual training examples. In practice, to compute the gradient \nabla C we need to compute the gradients \nabla C_x separately for each training input, x, and then average them, \nabla C = \frac{1}{n} \sum_x \nabla C_x . Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.

An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient \nabla C by computing \nabla C_x for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient \nabla C, and this helps speed up gradient descent, and thus learning.

………

 

什麼是『隨機』呢?維基百科詞條這麼說︰

Stochastic

The term stochastic occurs in a wide variety of professional or academic fields to describe events or systems that are unpredictable due to the influence of a random variable. The word “stochastic” comes from the Greek word στόχος (stokhos, “aim”).

Researchers refer to physical systems in which they are uncertain about the values of parameters, measurements, expected input and disturbances as “stochastic systems”. In probability theory, a purely stochastic system is one whose state is randomly determined, having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely. In this regard, it can be classified as non-deterministic (i.e., “random”) so that the subsequent state of the system is determined probabilistically. Any system or process that must be analyzed using probability theory is stochastic at least in part.[1][2] Stochastic systems and processes play a fundamental role in mathematical models of phenomena in many fields of science, engineering, finance and economics.

───

 

或可嘗試進一步了解︰

Stochastic gradient descent

Stochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions.

Iterative method

In stochastic (or “on-line”) gradient descent, the true gradient of Q(w) is approximated by a gradient at a single example:

w := w - \eta \nabla Q_i(w).

As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.

In pseudocode, stochastic gradient descent can be presented as follows:

 
  • Choose an initial vector of parameters w and learning rate \eta.
  • Repeat until an approximate minimum is obtained:
    • Randomly shuffle examples in the training set.
    • For \! i=1, 2, ..., n, do:
      • \! w := w - \eta \nabla Q_i(w).

A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a “mini-batch”) at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates \eta decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.[3] [4] This is in fact a consequence of the Robbins-Siegmund theorem.[5]

Stogra

Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.

───

 

或可藉『統計力學』揣想它之『合理性』可能有根源耶??!!

假使我們思考這樣的一個『問題』︰

一個由大量粒子構成的『物理系統』,這些粒子具有某一個『物理過程』描述的『隨機變X_i, i=1 \cdots N,那麼在 t 時刻,這個『隨機變』的『大數平均值

\frac{1}{N} \sum \limits_{i=1}^{N} P[X = X_i] \cdot X_i

,是這個『物理系統』由大量粒子表現的『瞬時圖像』,也就是『統計力學』上所說的『系綜平均』ensemble average 值。再從一個『典型粒子』的『隨機運動』上講,這個『隨機變X_{this} (t_i), i = 1 \cdots N 會在不同時刻隨機的取值,因此就可以得到此一個『典型粒子』之『隨機變』的『時間平均值

\frac{1}{N} \sum \limits_{i=1}^{N} P[t = t_i] \cdot X_{this}(t_i)

,這說明了此一『典型粒子』在『物理系統』中的『歷時現象』 ,那麼此兩種平均值,它們的大小是一樣的嗎??

在『德汝德模型』中我們已經知道 P_{nc}(t) = e^{- t / \tau} 是一個『電子』於 t 時距裡不發生碰撞的機率。這樣 P_{nc}(t) - P_{nc}(t+dt) 的意思就是,在 tt+dt 的時間點發生碰撞的機率。參考指數函數 e 的『泰勒展開式

e^x = \sum \limits_{k=0}^{\infty} \frac{x^k}{k!}

,如此

P_{nc}(t) - P_{nc}(t+dt) = e^{- \frac {t}{ \tau}} - e^{- \frac {t+dt}{ \tau}} = e^{- \frac {t}{ \tau}} \left[ 1 -  e^{- \frac {dt}{ \tau}} \right]

\approx  e^{- \frac {t}{ \tau}} \cdot \frac{dt}{\tau}

,這倒過來說明了為什麼在『德汝德模型』中,發生碰撞的機率是 \frac{dt}{\tau},於是一個有 N 個『自由電子』的導體,在 t+dt 時刻可能有 N \cdot e^{- \frac {t}{ \tau}} \cdot \frac{dt}{\tau} 個電子發生碰撞,碰撞『平均時距』的『系綜平均』是

\frac{1}{N} \int_{0}^{\infty} t \cdot N \cdot e^{- \frac {t}{ \tau}} \cdot \frac{dt}{\tau} = \tau

。比之於《【Sonic π】電路學之補充《一》》一文中之電子的『時間平均值』,果然這兩者相等。事實上一般物理系統要是處於統計力學所說的『平衡狀態』,這兩種『平均值』都會是『相等』的。當真是『考典範以歷史』與『察大眾於一時』都能得到相同結論的嗎??

─── 摘自《【Sonic π】電路學之補充《二》