秦失其鹿,天下共逐之。一個重要的演算法,通常會有許多變化。Michael Nielsen 先生此處寫給逐鹿中原者耶??
Momentum-based gradient descent: Intuitively, the advantage Hessian optimization has is that it incorporates not just information about the gradient, but also information about how the gradient is changing. Momentum-based gradient descent is based on a similar intuition, but avoids large matrices of second derivatives. To understand the momentum technique, think back to our original picture of gradient descent, in which we considered a ball rolling down into a valley. At the time, we observed that gradient descent is, despite its name, only loosely similar to a ball falling to the bottom of a valley. The momentum technique modifies gradient descent in two ways that make it more similar to the physical picture. First, it introduces a notion of “velocity” for the parameters we’re trying to optimize. The gradient acts to change the velocity, not (directly) the “position”, in much the same way as physical forces change the velocity, and only indirectly affect position. Second, the momentum method introduces a kind of friction term, which tends to gradually reduce the velocity.
Let’s give a more precise mathematical description. We introduce velocity variables , one for each corresponding variable*
*In a neural net the variables would, of course, include all weights and biases.
. Then we replace the gradient descent update rule by
In these equations, is a hyper-parameter which controls the amount of damping or friction in the system. To understand the meaning of the equations it’s helpful to first consider the case where , which corresponds to no friction. When that’s the case, inspection of the equations shows that the “force” is now modifying the velocity, v, and the velocity is controlling the rate of change of . Intuitively, we build up the velocity by repeatedly adding gradient terms to it. That means that if the gradient is in (roughly) the same direction through several rounds of learning, we can build up quite a bit of steam moving in that direction. Think, for example, of what happens if we’re moving straight down a slope:
I’ve avoided naming the hyper-parameter up to now. The reason is that the standard name for is badly chosen: it’s called the momentum co-efficient. This is potentially confusing, since is not at all the same as the notion of momentum from physics. Rather, it is much more closely related to friction. However, the term momentum co-efficient is widely used, so we will continue to use it.
A nice thing about the momentum technique is that it takes almost no work to modify an implementation of gradient descent to incorporate momentum. We can still use backpropagation to compute the gradients, just as before, and use ideas such as sampling stochastically chosen mini-batches. In this way, we can get some of the advantages of the Hessian technique, using information about how the gradient is changing. But it’s done without the disadvantages, and with only minor modifications to our code. In practice, the momentum technique is commonly used, and often speeds up learning.
Exercise
- What would go wrong if we used in the momentum technique?
- What would go wrong if we used in the momentum technique?
如果 , 恐將不受制於 之『力』矣。假使 , 為正, 為負,恐會震盪也。
Problem
- Add momentum-based stochastic gradient descent to network2.py.
Other approaches to minimizing the cost function: Many other approaches to minimizing the cost function have been developed, and there isn’t universal agreement on which is the best approach. As you go deeper into neural networks it’s worth digging into the other techniques, understanding how they work, their strengths and weaknesses, and how to apply them in practice. A paper I mentioned earlier*
*Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998).
introduces and compares several of these techniques, including conjugate gradient descent and the BFGS method (see also the closely related limited-memory BFGS method, known as L-BFGS). Another technique which has recently shown promising results*
*See, for example, On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton (2012).
is Nesterov’s accelerated gradient technique, which improves on the momentum technique. However, for many problems, plain stochastic gradient descent works well, especially if momentum is used, and so we’ll stick to stochastic gradient descent through the remainder of this book.
───
尚須注意各家記號法之差異,解釋方式的不同,以免徒增困擾乎! !
Stochastic gradient descent
Stochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD, tries to find minimums or maximums by iteration.
……
Momentum
Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams‘ seminal paper on backpropagation learning.[10] stochastic gradient descent with momentum remembers the update Δ w at each iteration, and determines the next update as a convex combination of the gradient and the previous update:[11]
or as a mathematically equivalent formulation:[12]
that leads to:
where the parameter which minimizes is to be estimated, and is a step size (sometimes called the learning rate in machine learning).
The name momentum stems from an analogy to momentum in physics: the weight vector, thought of as a particle traveling through parameter space,[10] incurs acceleration from the gradient of the loss (“force“). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully for several decades.[13]
───