W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】三

秦失其鹿,天下共逐之。一個重要的演算法,通常會有許多變化。Michael Nielsen 先生此處寫給逐鹿中原者耶??

Momentum-based gradient descent: Intuitively, the advantage Hessian optimization has is that it incorporates not just information about the gradient, but also information about how the gradient is changing. Momentum-based gradient descent is based on a similar intuition, but avoids large matrices of second derivatives. To understand the momentum technique, think back to our original picture of gradient descent, in which we considered a ball rolling down into a valley. At the time, we observed that gradient descent is, despite its name, only loosely similar to a ball falling to the bottom of a valley. The momentum technique modifies gradient descent in two ways that make it more similar to the physical picture. First, it introduces a notion of “velocity” for the parameters we’re trying to optimize. The gradient acts to change the velocity, not (directly) the “position”, in much the same way as physical forces change the velocity, and only indirectly affect position. Second, the momentum method introduces a kind of friction term, which tends to gradually reduce the velocity.

Let’s give a more precise mathematical description. We introduce velocity variables v = v_1, v_2, \ldots, one for each corresponding w_j variable*

*In a neural net the w_j variables would, of course, include all weights and biases.

. Then we replace the gradient descent update rule w \rightarrow w'= w-\eta \nabla C by

v & \rightarrow & v' = \mu v - \eta \nabla C \ \ \ \ (107)
w & \rightarrow & w' = w+v'. \ \ \ \ (108)

In these equations, \mu is a hyper-parameter which controls the amount of damping or friction in the system. To understand the meaning of the equations it’s helpful to first consider the case where \mu = 1, which corresponds to no friction. When that’s the case, inspection of the equations shows that the “force” \nabla C is now modifying the velocity, v, and the velocity is controlling the rate of change of w. Intuitively, we build up the velocity by repeatedly adding gradient terms to it. That means that if the gradient is in (roughly) the same direction through several rounds of learning, we can build up quite a bit of steam moving in that direction. Think, for example, of what happens if we’re moving straight down a slope:

With each step the velocity gets larger down the slope, so we move more and more quickly to the bottom of the valley. This can enable the momentum technique to work much faster than standard gradient descent. Of course, a problem is that once we reach the bottom of the valley we will overshoot. Or, if the gradient should change rapidly, then we could find ourselves moving in the wrong direction. That’s the reason for the \mu hyper-parameter in (107). I said earlier that \mu controls the amount of friction in the system; to be a little more precise, you should think of 1 - \mu as the amount of friction in the system. When \mu = 1, as we’ve seen, there is no friction, and the velocity is completely driven by the gradient \nabla C. By contrast, when \mu = 0 there’s a lot of friction, the velocity can’t build up, and Equations (107) and (108) reduce to the usual equation for gradient descent, w \rightarrow w'=w-\eta \nabla C. In practice, using a value of \mu intermediate between 0 and 1 can give us much of the benefit of being able to build up speed, but without causing overshooting. We can choose such a value for \mu using the held-out validation data, in much the same way as we select \eta and \lambda.

I’ve avoided naming the hyper-parameter \mu up to now. The reason is that the standard name for \mu is badly chosen: it’s called the momentum co-efficient. This is potentially confusing, since \mu is not at all the same as the notion of momentum from physics. Rather, it is much more closely related to friction. However, the term momentum co-efficient is widely used, so we will continue to use it.

A nice thing about the momentum technique is that it takes almost no work to modify an implementation of gradient descent to incorporate momentum. We can still use backpropagation to compute the gradients, just as before, and use ideas such as sampling stochastically chosen mini-batches. In this way, we can get some of the advantages of the Hessian technique, using information about how the gradient is changing. But it’s done without the disadvantages, and with only minor modifications to our code. In practice, the momentum technique is commonly used, and often speeds up learning.

Exercise

  • What would go wrong if we used \mu > 1 in the momentum technique?
  • What would go wrong if we used \mu < 0 in the momentum technique?

 

如果 \mu > 1{\mu}^k \to \infty , \ if \ k \to \infty 恐將不受制於 \nabla C 之『力』矣。假使 \mu <0{\mu}^{2k} 為正, {\mu}^{2k+1} 為負,恐會震盪也。

Problem

  • Add momentum-based stochastic gradient descent to network2.py.

Other approaches to minimizing the cost function: Many other approaches to minimizing the cost function have been developed, and there isn’t universal agreement on which is the best approach. As you go deeper into neural networks it’s worth digging into the other techniques, understanding how they work, their strengths and weaknesses, and how to apply them in practice. A paper I mentioned earlier*

*Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998).

introduces and compares several of these techniques, including conjugate gradient descent and the BFGS method (see also the closely related limited-memory BFGS method, known as L-BFGS). Another technique which has recently shown promising results*

*See, for example, On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton (2012).

is Nesterov’s accelerated gradient technique, which improves on the momentum technique. However, for many problems, plain stochastic gradient descent works well, especially if momentum is used, and so we’ll stick to stochastic gradient descent through the remainder of this book.

───

 

尚須注意各家記號法之差異,解釋方式的不同,以免徒增困擾乎! !

Stochastic gradient descent

Stochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD, tries to find minimums or maximums by iteration.

……

Momentum

Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams‘ seminal paper on backpropagation learning.[10] stochastic gradient descent with momentum remembers the update Δ w at each iteration, and determines the next update as a convex combination of the gradient and the previous update:[11]

\Delta w:=\eta \nabla Q_{i}(w)+\alpha \Delta w
  {\displaystyle w:=w-\Delta w}

or as a mathematically equivalent formulation:[12]

{\displaystyle \Delta w:=-\eta \nabla Q_{i}(w)+\alpha \Delta w}
{\displaystyle w:=w+\Delta w}

that leads to:

{\displaystyle w:=w-\eta \nabla Q_{i}(w)+\alpha \Delta w}

where the parameter  w which minimizes  Q(w) is to be estimated, and  \eta is a step size (sometimes called the learning rate in machine learning).

The name momentum stems from an analogy to momentum in physics: the weight vector, thought of as a particle traveling through parameter space,[10] incurs acceleration from the gradient of the loss (“force“). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully for several decades.[13]

───