W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】二

天下事因大勢所趨,故英雄所見略同。Michael Nielsen 先生講

Variations on stochastic gradient descent

Stochastic gradient descent by backpropagation has served us well in attacking the MNIST digit classification problem. However, there are many other approaches to optimizing the cost function, and sometimes those other approaches offer performance superior to mini-batch stochastic gradient descent. In this section I sketch two such approaches, the Hessian and momentum techniques.

Hessian technique: To begin our discussion it helps to put neural networks aside for a bit. Instead, we’re just going to consider the abstract problem of minimizing a cost function C which is a function of many variables, w = w_1, w_2, \ldots, so C = C(w). By Taylor’s theorem, the cost function can be approximated near a point w by

C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j \partial w_k} \Delta w_k + \ldots \ \ \ \ (103)

We can rewrite this more compactly as

C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w + \ldots, \ \ \ \ (104)

where \nabla C is the usual gradient vector, and H is a matrix known as the Hessian matrix, whose jkth entry is \partial^2 C / \partial w_j \partial w_k. Suppose we approximate C by discarding the higher-order terms represented by \ldots above,

C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w. \ \ \ \ (105)

Using calculus we can show that the expression on the right-hand side can be minimized*

*Strictly speaking, for this to be a minimum, and not merely an extremum, we need to assume that the Hessian matrix is positive definite. Intuitively, this means that the function C looks like a valley locally, not a mountain or a saddle.

by choosing

\Delta w = -H^{-1} \nabla C. \ \ \ \ (106)

Provided (105) is a good approximate expression for the cost function, then we’d expect that moving from the point w to w+\Delta w = w-H^{-1} \nabla C should significantly decrease the cost function. That suggests a possible algorithm for minimizing the cost:

  • Choose a starting point, w.
  • Update w to a new point w' = w-H^{-1} \nabla C, where the Hessian H and \nabla C are computed at w.
  • Update w' to a new point w{'}{'} = w'-H'^{-1} \nabla' C, where the Hessian H' and \nabla' C are computed at w'.
  • \ldots

In practice, (105) is only an approximation, and it’s better to take smaller steps. We do this by repeatedly changing w by an amount \Delta w = -\eta H^{-1} \nabla C, where \eta is known as the learning rate.

This approach to minimizing a cost function is known as the Hessian technique or Hessian optimization. There are theoretical and empirical results showing that Hessian methods converge on a minimum in fewer steps than standard gradient descent. In particular, by incorporating information about second-order changes in the cost function it’s possible for the Hessian approach to avoid many pathologies that can occur in gradient descent. Furthermore, there are versions of the backpropagation algorithm which can be used to compute the Hessian.

If Hessian optimization is so great, why aren’t we using it in our neural networks? Unfortunately, while it has many desirable properties, it has one very undesirable property: it’s very difficult to apply in practice. Part of the problem is the sheer size of the Hessian matrix. Suppose you have a neural network with 10^7 weights and biases. Then the corresponding Hessian matrix will contain 10^7 \times 10^7 = 10^{14} entries. That’s a lot of entries! And that makes computing H^{-1} \nabla C extremely difficult in practice. However, that doesn’t mean that it’s not useful to understand. In fact, there are many variations on gradient descent which are inspired by Hessian optimization, but which avoid the problem with overly-large matrices. Let’s take a look at one such technique, momentum-based gradient descent.

───

 

與史丹佛 Stanford CS 課堂說

cs231n.github.io/neural-networks-3

Second order methods

A second, popular group of methods for optimization in context of deep learning is based on Newton’s method, which iterates the following update:

x \leftarrow x - [H f(x)]^{-1} \nabla f(x)

Here, H f(x) is the Hessian matrix, which is a square matrix of second-order partial derivatives of the function. The term \nabla f(x) is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods.

However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of quasi-Newton methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is L-BFGS, which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed).

However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research.

In practice, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov’s) momentum are more standard because they are simpler and scale more easily.

Additional references:

  • Large Scale Distributed Deep Networks is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization.
  • SFO algorithm strives to combine the advantages of SGD with advantages of L-BFGS.

……

【參考網頁】

cs231n.github.io

cs231n.github.io/neural-networks-1

cs231n.github.io/neural-networks-2

cs231n.github.io/neural-networks-3

───

 

幾乎如出一轍,誠非偶然之事耶!!

亦須了此牛頓法也頗有來歷的乎??

Newton’s method in optimization

In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function f (i.e. solutions to the equation f(x)=0). In optimization, Newton’s method is applied to the derivative f of a twice-differentiable function f to find the roots of the derivative (solutions to f ′(x)=0), also known as the stationary points of f.

Method

In the one-dimensional problem, Newton’s method attempts to construct a sequence xn from an initial guess x0 that converges towards some value x* satisfying f ′(x*)=0. This x* is a stationary point of f.

The second order Taylor expansion fT(x) of f around xn is:

.

We want to find Δx such that f(xn + Δx) is maximum. We seek to solve the equation that sets the derivative of this last expression with respect to Δx equal to zero:

.

For the value of Δx = −f ′(xn) / f ″(xn), which is the solution of this equation, it can be hoped that xn+1 = xn + Δx = xnf ′(xn) / f ″(xn) will be closer to a stationary point x*. Provided that f is a twice-differentiable function and other technical conditions are satisfied, the sequence x1, x2, … will converge to a point x* satisfying f ′(x*)=0.

220px-Newton_optimization_vs_grad_descent.svg

A comparison of gradient descent (green) and Newton’s method (red) for minimizing a function (with small step sizes). Newton’s method uses curvature information to take a more direct route.

Geometric interpretation

The geometric interpretation of Newton’s method is that at each iteration one approximates f(x) by a quadratic function around xn, and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if f(x) happens to be a quadratic function, then the exact extremum is found in one step.

Higher dimensions

The above iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, f(x), and the reciprocal of the second derivative with the inverse of the Hessian matrix, H f(x). One obtains the iterative scheme

Often Newton’s method is modified to include a small step size γ ∈ (0,1) instead of γ = 1

This is often done to ensure that the Wolfe conditions are satisfied at each step xnxn+1 of the iteration. For step sizes other than 1, the method is often referred to as the relaxed Newton’s method.

Where applicable, Newton’s method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood N such that, if we start with x0N, Newton’s method with step size γ = 1 converges quadratically (if the Hessian is invertible and a Lipschitz continuous function of x in that neighborhood).

───

Hessian matrix

Use in optimization

Hessian matrices are used in large-scale optimization problems within Newton-type methods because they are the coefficient of the quadratic term of a local Taylor expansion of a function. That is,

{\displaystyle y=f(\mathbf {x} +\Delta \mathbf {x} )\approx f(\mathbf {x} )+\nabla f(\mathbf {x} )^{\mathrm {T} }\Delta \mathbf {x} +{\frac {1}{2}}\Delta \mathbf {x} ^{\mathrm {T} }\mathbf {H} (\mathbf {x} )\Delta \mathbf {x} }

where f is the gradient (f/x1, …, f/xn). Computing and storing the full Hessian matrix takes Θ(n2) memory, which is infeasible for high-dimensional functions such as the loss functions of neural nets, conditional random fields, and other statistical models with large numbers of parameters. For such situations, truncated-Newton and quasi-Newton algorithms have been developed. The latter family of algorithms use approximations to the Hessian; one of the most popular quasi-Newton algorithms is BFGS.[7]

Such approximations may use the fact that an optimization algorithm uses the Hessian only as a linear operator H(v), and proceed by first noticing that the Hessian also appears in the local expansion of the gradient:

\nabla f({\mathbf {x}}+\Delta {\mathbf {x}})=\nabla f({\mathbf {x}})+{\mathbf {H}}(\Delta {\mathbf {x}})+O(\|\Delta {\mathbf {x}}\|^{2})

Letting Δx = rv for some scalar r, this gives

{\mathbf {H}}(\Delta {\mathbf {x}})={\mathbf {H}}(r{\mathbf {v}})=r{\mathbf {H}}({\mathbf {v}})=\nabla f({\mathbf {x}}+r{\mathbf {v}})-\nabla f({\mathbf {x}})+O(r^{2}),

i.e.,

{\mathbf {H}}({\mathbf {v}})={\frac {1}{r}}{\Bigl [}\nabla f({\mathbf {x}}+r{\mathbf {v}})-\nabla f({\mathbf {x}}){\Bigr ]}+O(r)

so if the gradient is already computed, the approximate Hessian can be computed by a linear (in the size of the gradient) number of scalar operations. (While simple to program, this approximation scheme is not numerically stable since r has to be made small to prevent error due to the O(r) term, but decreasing it loses precision in the first term.[8])

───