W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】三

有人快馬加鞭追過風馳電掣。也有人輕舟慢櫓漸及月轉星移。不知Michael Nielsen 先生是怎樣人物?接續著『過適現象』之問題,先大部頭文字細談『正則化』之『實務』︰

Regularization

Increasing the amount of training data is one way of reducing overfitting. Are there other ways we can reduce the extent to which overfitting occurs? One possible approach is to reduce the size of our network. However, large networks have the potential to be more powerful than small networks, and so this is an option we’d only adopt reluctantly.

Fortunately, there are other techniques which can reduce overfitting, even when we have a fixed network and fixed training data. These are known as regularization techniques. In this section I describe one of the most commonly used regularization techniques, a technique sometimes known as weight decay or L2 regularization. The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term. Here’s the regularized cross-entropy:

C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln (1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2. \ \ \ \ (85)

The first term is just the usual expression for the cross-entropy. But we’ve added a second term, namely the sum of the squares of all the weights in the network. This is scaled by a factor \lambda / 2n, where \lambda > 0 is known as the regularization parameter, and n is, as usual, the size of our training set. I’ll discuss later how \lambda is chosen. It’s also worth noting that the regularization term doesn’t include the biases. I’ll also come back to that below.

Of course, it’s possible to regularize other cost functions, such as the quadratic cost. This can be done in a similar way:

C = \frac{1}{2n} \sum_x \|y-a^L\|^2 + \frac{\lambda}{2n} \sum_w w^2. \ \ \ \ (86)

In both cases we can write the regularized cost function as

C = C_0 + \frac{\lambda}{2n} \sum_w w^2, \ \ \ \ (87)

where C_0 is the original, unregularized cost function.

Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of \lambda: when \lambda is small we prefer to minimize the original cost function, but when \lambda is large we prefer small weights.

Now, it’s really not at all obvious why making this kind of compromise should help reduce overfitting! But it turns out that it does. We’ll address the question of why it helps in the next section. But first, let’s work through an example showing that regularization really does reduce overfitting.

……

I’ve described regularization as a way to reduce overfitting and to increase classification accuracies. In fact, that’s not the only benefit. Empirically, when doing multiple runs of our MNIST networks, but with different (random) weight initializations, I’ve found that the unregularized runs will occasionally get “stuck”, apparently caught in local minima of the cost function. The result is that different runs sometimes provide quite different results. By contrast, the regularized runs have provided much more easily replicable results.

Why is this going on? Heuristically, if the cost function is unregularized, then the length of the weight vector is likely to grow, all other things being equal. Over time this can lead to the weight vector being very large indeed. This can cause the weight vector to get stuck pointing in more or less the same direction, since changes due to gradient descent only make tiny changes to the direction, when the length is long. I believe this phenomenon is making it hard for our learning algorithm to properly explore the weight space, and consequently harder to find good minima of the cost function.

───

 

恐為重視『經驗』者耶??由於文本已經寫的清清楚楚、明明白白 ,那就講點『解釋名詞』與『概念淵源』的吧!!那維基百科詞條

Regularization (mathematics)

Regularization, in mathematics and statistics and particularly in the fields of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.

Introduction

In general, a regularization term R(f) is introduced to a general loss function:

\min_f \sum_{i=1}^{n} V(f(\hat x_i), \hat y_i) + \lambda R(f)

for a loss function V that describes the cost of predicting f(x) when the label is y, such as the square loss or hinge loss, and for the term \lambda which controls the importance of the regularization term. R(f) is typically a penalty on the complexity of f, such as restrictions for smoothness or bounds on the vector space norm.[1]

A theoretical justification for regularization is that it attempts to impose Occam’s razor on the solution, as depicted in the figure. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.

Regularization can be used to learn simpler models, induce models to be sparse, introduce group structure into the learning problem, and more.

The same idea arose in many fields of science. For example, the least-squares method can be viewed as a very simple form of regularization. A simple form of regularization applied to integral equations, generally termed Tikhonov regularization after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization have become popular.

……

Regularization.svg

The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution, by adjusting \lambda, the weight of the regularization term.

……

Tikhonov regularization

When learning a linear function, such that f(x) = w \cdot x, the L_2 norm loss corresponds to Tikhonov regularization. This is one of the most common forms of regularization, is also known as ridge regression, and is expressed as:

\min_w \sum_{i=1}^{n} V(\hat x_i \cdot w, \hat y_i) + \lambda \|w\|_{2}^{2}

In the case of a general function, we take the norm of the function in its reproducing kernel Hilbert space:

\min_f \sum_{i=1}^{n} V(f(\hat x_i), \hat y_i) + \lambda \|f\|_{\mathcal{H}}^{2}

As the L_2 norm is differentiable, learning problems using Tikhonov regularization can be solved by gradient descent.

Tikhonov regularized least squares

The learning problem with the least squares loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal w will be the one for which the gradient of the loss function with respect to w to 0.

\min_w \frac{1}{n} (\hat X w - \hat Y)^2 + \lambda \|w\|_{2}^{2}
\nabla_w = \frac{2}{n} \hat X^T (\hat X w - \hat Y) + 2 \lambda w       \leftarrowThis is the first-order condition for this optimization problem
0 = \hat X^T (\hat X w - \hat Y) + n \lambda w
w = (\hat X^T \hat X + \lambda n I)^{-1} (\hat X^T \hat Y)

By construction of the optimization problem, other values of w would give larger values for the loss function. This could be verified by examining the second derivative \nabla_{ww}.

During training, this algorithm takes O(d^3 + nd^2) time. The terms correspond to the matrix inversion and calculating X^T X, respectively. Testing takes O(nd) time.

───

 

概括了此法的『目的』和多個『正則化』之『函式型態』。故可知其與『大象說』以及『奧卡姆剃刀』的淵源矣!!??

在《踏雪尋梅!!》文本中,我們簡介了『變分法』︰

那麼科學上如何看待『預言』的呢?比方講一七四四年瑞士大數學家和物理學家萊昂哈德‧歐拉 Leonhard Euler 在《尋找具有極大值或極小值性質的曲線,等周問題的最廣義解答》 Methodus inveniendi lineas curvas maximi minimive proprietate gaudentes, sive solutio problematis isoperimetrici lattissimo sensu accepti 論文中,非常清晰明白的給出『最小作用量原理』的定義

假使一個質量為 M,速度為 v 的粒子移動無窮小距離 ds 時。這時粒子的動量為 M \cdot v,當乘以此無窮小距離 ds 後,給出 M \cdot v \ ds ,這是粒子的動量作用於無窮小『路徑ds 距離之上。我宣稱︰在所有連結『始終』兩個端點的可能『路徑』之中,這個粒子運動的真實『軌跡』是 \int_{initial}^{final}  M \cdot v \ ds 為最小值的『路徑』;如果假定質量是個常數,也就是\int_{initial}^{final}  v \ ds 為最小值的『軌道』。

也就是說,在所有連結『始終』兩個端點的可能『路徑path 之中, 粒子所選擇的『路徑』是『作用量A = \int_{path}  M \cdot v \ ds 泛函數的『極值』,這是牛頓第二運動定律的『變分法』Variation 描述。如果從今天物理能量的觀點來看 A = \int_{path}  M \cdot v \ ds = \int_{path}  M \cdot v \ \frac {ds}{dt} dt = \int_{path}  M \cdot v^2 dt = 2 \int_{path} T dt,此處 T = \frac{1}{2} M v^2 就是粒子的動能。因為牛頓第二運動定律可以表述為 F = M \cdot a = \frac {d P}{dt}, \ P = M \cdot v,所以 \int_{path}  \frac {d P}{dt} ds = \int_{path}  \frac {d s}{dt} dP = \int_{path}  v dP  = \Delta T = \int_{path}  F ds

假使粒子所受的力是『保守力』conservative force,也就是講此力沿著任何路徑所作的『』work 只跟粒子『始終』兩個端點的『位置』有關,與它行經的『路徑』無關。在物理上這時通常將它定義成這個『力場』的『位能V = - \int_{ref}^{position}  F ds,於是如果一個粒子在一個保守場中,\Delta T + \Delta V = 0,這就是物理上『能量守恆』原理!舉例來說重力、彈簧力、電場力等等,都是保守力,然而摩擦力和空氣阻力種種都是典型的非保守力。由於 \Delta V 在這些可能路徑裡都不變,因此『最小作用量原理』所確定的『路徑』也就是『作用量A 的『極值』。一七八八年法國籍義大利裔數學家和天文學家約瑟夫‧拉格朗日 Joseph Lagrange 對於變分法發展貢獻很大,最早在其論文《分析力學》Mecanique Analytique 裡,使用『能量守恆定律』推導出了歐拉陳述的最小作用量原理的正確性。

───

 

提及了『拉格朗日』之功業,今稱之為『拉格朗日乘數』者︰

Lagrange multiplier

In mathematical optimization, the method of Lagrange multipliers (named after Joseph Louis Lagrange[1]) is a strategy for finding the local maxima and minima of a function subject to equality constraints.

For instance (see Figure 1), consider the optimization problem

maximize f(x, y)
subject to g(x, y) = 0.

We need both f and g to have continuous first partial derivatives. We introduce a new variable (λ) called a Lagrange multiplier and study the Lagrange function (or Lagrangian) defined by

 \mathcal{L}(x,y,\lambda) = f(x,y) - \lambda \cdot g(x,y),

where the λ term may be either added or subtracted. If f(x0, y0) is a maximum of f(x, y) for the original constrained problem, then there exists λ0 such that (x0, y0, λ0) is a stationary point for the Lagrange function (stationary points are those points where the partial derivatives of \mathcal{L} are zero). However, not all stationary points yield a solution of the original problem. Thus, the method of Lagrange multipliers yields a necessary condition for optimality in constrained problems.[2][3][4][5][6] Sufficient conditions for a minimum or maximum also exist.

300px-LagrangeMultipliers3D

Figure 1: Find x and y to maximize f(x, y) subject to a constraint (shown in red) g(x, y) = c.

LagrangeMultipliers2D.svg

Figure 2: Contour map of Figure 1. The red line shows the constraint g(x, y) = c. The blue lines are contours of f(x, y). The point where the red line tangentially touches a blue contour is the solution. Since d1 > d2, the solution is a maximization of f(x, y).

……

拉格朗日乘數的運用方法

f定義為在Rn上的方程,約束為gkx)= ck(或將約束左移得到gk(x) − ck = 0)。定義拉格朗日Λ

\Lambda(\mathbf x, \boldsymbol \lambda) = f + \sum_k \lambda_k(g_k-c_k)

注意極值的條件和約束現在就都被記錄到一個式子裡了:

\nabla \Lambda = 0 \Leftrightarrow \nabla f = - \sum_k \lambda_k \nabla\ g_k,

\nabla_{\mathbf \lambda} \Lambda = 0 \Leftrightarrow g_k = c_k

拉格朗日乘數常被用作表達最大增長值。原因是從式子:

-\frac{\partial \Lambda}{\partial {c_k}} = \lambda_k

中我們可以看出λk是當方程在被約束條件下,能夠達到的最大增長率。拉格朗日力學就使用到這個原理。

拉格朗日乘數法在卡羅需-庫恩-塔克條件被推廣。

───

Example 3: Entropy

Suppose we wish to find the discrete probability distribution on the points \{p_1, p_2, \cdots, p_n\} with maximal information entropy. This is the same as saying that we wish to find the least biased probability distribution on the points \{p_1, p_2, \cdots, p_n\}. In other words, we wish to maximize the Shannon entropy equation:

f(p_1,p_2,\cdots,p_n) = -\sum_{j=1}^n p_j\log_2 p_j.

For this to be a probability distribution the sum of the probabilities  p_i at each point x_i must equal 1, so our constraint is:

g(p_1,p_2,\cdots,p_n)=\sum_{j=1}^n p_j = 1.

We use Lagrange multipliers to find the point of maximum entropy, \vec{p}^{\,*}, across all discrete probability distributions \vec{p} on \{x_1,x_2, \cdots, x_n\}. We require that:

\left.\frac{\partial}{\partial \vec{p}}(f+\lambda (g-1))\right|_{\vec{p}=\vec{p}^{\,*}}=0,

which gives a system of n equations,  k ={1,\cdots,n}, such that:

\left.\frac{\partial}{\partial p_k}\left\{-\left (\sum_{j=1}^n p_j \log_2 p_j \right ) + \lambda \left(\sum_{j=1}^n p_j - 1\right) \right\}\right|_{p_k=p^*_k} = 0.

Carrying out the differentiation of these n equations, we get

-\left(\frac{1}{\ln 2}+\log_2 p^*_k \right) + \lambda = 0.

This shows that all p^*_k are equal (because they depend on λ only). By using the constraint

\sum_j p_j =1,

we find

p^*_k = \frac{1}{n}.

Hence, the uniform distribution is the distribution with the greatest entropy, among distributions on n points.

───

 

實首啟此法的『門徑』也??!!