W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】三

有人快馬加鞭追過風馳電掣。也有人輕舟慢櫓漸及月轉星移。不知Michael Nielsen 先生是怎樣人物?接續著『過適現象』之問題,先大部頭文字細談『正則化』之『實務』︰

Regularization

Increasing the amount of training data is one way of reducing overfitting. Are there other ways we can reduce the extent to which overfitting occurs? One possible approach is to reduce the size of our network. However, large networks have the potential to be more powerful than small networks, and so this is an option we’d only adopt reluctantly.

Fortunately, there are other techniques which can reduce overfitting, even when we have a fixed network and fixed training data. These are known as regularization techniques. In this section I describe one of the most commonly used regularization techniques, a technique sometimes known as weight decay or L2 regularization. The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term. Here’s the regularized cross-entropy:

C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln (1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2. \ \ \ \ (85)

The first term is just the usual expression for the cross-entropy. But we’ve added a second term, namely the sum of the squares of all the weights in the network. This is scaled by a factor \lambda / 2n, where \lambda > 0 is known as the regularization parameter, and n is, as usual, the size of our training set. I’ll discuss later how \lambda is chosen. It’s also worth noting that the regularization term doesn’t include the biases. I’ll also come back to that below.

Of course, it’s possible to regularize other cost functions, such as the quadratic cost. This can be done in a similar way:

C = \frac{1}{2n} \sum_x \|y-a^L\|^2 + \frac{\lambda}{2n} \sum_w w^2. \ \ \ \ (86)

In both cases we can write the regularized cost function as

C = C_0 + \frac{\lambda}{2n} \sum_w w^2, \ \ \ \ (87)

where C_0 is the original, unregularized cost function.

Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of \lambda: when \lambda is small we prefer to minimize the original cost function, but when \lambda is large we prefer small weights.

Now, it’s really not at all obvious why making this kind of compromise should help reduce overfitting! But it turns out that it does. We’ll address the question of why it helps in the next section. But first, let’s work through an example showing that regularization really does reduce overfitting.

……

I’ve described regularization as a way to reduce overfitting and to increase classification accuracies. In fact, that’s not the only benefit. Empirically, when doing multiple runs of our MNIST networks, but with different (random) weight initializations, I’ve found that the unregularized runs will occasionally get “stuck”, apparently caught in local minima of the cost function. The result is that different runs sometimes provide quite different results. By contrast, the regularized runs have provided much more easily replicable results.

Why is this going on? Heuristically, if the cost function is unregularized, then the length of the weight vector is likely to grow, all other things being equal. Over time this can lead to the weight vector being very large indeed. This can cause the weight vector to get stuck pointing in more or less the same direction, since changes due to gradient descent only make tiny changes to the direction, when the length is long. I believe this phenomenon is making it hard for our learning algorithm to properly explore the weight space, and consequently harder to find good minima of the cost function.

───

 

恐為重視『經驗』者耶??由於文本已經寫的清清楚楚、明明白白 ,那就講點『解釋名詞』與『概念淵源』的吧!!那維基百科詞條

Regularization (mathematics)

Regularization, in mathematics and statistics and particularly in the fields of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting.

Introduction

In general, a regularization term R(f) is introduced to a general loss function:

\min_f \sum_{i=1}^{n} V(f(\hat x_i), \hat y_i) + \lambda R(f)

for a loss function V that describes the cost of predicting f(x) when the label is y, such as the square loss or hinge loss, and for the term \lambda which controls the importance of the regularization term. R(f) is typically a penalty on the complexity of f, such as restrictions for smoothness or bounds on the vector space norm.[1]

A theoretical justification for regularization is that it attempts to impose Occam’s razor on the solution, as depicted in the figure. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.

Regularization can be used to learn simpler models, induce models to be sparse, introduce group structure into the learning problem, and more.

The same idea arose in many fields of science. For example, the least-squares method can be viewed as a very simple form of regularization. A simple form of regularization applied to integral equations, generally termed Tikhonov regularization after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization have become popular.

……

Regularization.svg

The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution, by adjusting \lambda, the weight of the regularization term.

……

Tikhonov regularization

When learning a linear function, such that f(x) = w \cdot x, the L_2 norm loss corresponds to Tikhonov regularization. This is one of the most common forms of regularization, is also known as ridge regression, and is expressed as:

\min_w \sum_{i=1}^{n} V(\hat x_i \cdot w, \hat y_i) + \lambda \|w\|_{2}^{2}

In the case of a general function, we take the norm of the function in its reproducing kernel Hilbert space:

\min_f \sum_{i=1}^{n} V(f(\hat x_i), \hat y_i) + \lambda \|f\|_{\mathcal{H}}^{2}

As the L_2 norm is differentiable, learning problems using Tikhonov regularization can be solved by gradient descent.

Tikhonov regularized least squares

The learning problem with the least squares loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal w will be the one for which the gradient of the loss function with respect to w to 0.

\min_w \frac{1}{n} (\hat X w - \hat Y)^2 + \lambda \|w\|_{2}^{2}
\nabla_w = \frac{2}{n} \hat X^T (\hat X w - \hat Y) + 2 \lambda w       \leftarrowThis is the first-order condition for this optimization problem
0 = \hat X^T (\hat X w - \hat Y) + n \lambda w
w = (\hat X^T \hat X + \lambda n I)^{-1} (\hat X^T \hat Y)

By construction of the optimization problem, other values of w would give larger values for the loss function. This could be verified by examining the second derivative \nabla_{ww}.

During training, this algorithm takes O(d^3 + nd^2) time. The terms correspond to the matrix inversion and calculating X^T X, respectively. Testing takes O(nd) time.

───

 

概括了此法的『目的』和多個『正則化』之『函式型態』。故可知其與『大象說』以及『奧卡姆剃刀』的淵源矣!!??

在《踏雪尋梅!!》文本中,我們簡介了『變分法』︰

那麼科學上如何看待『預言』的呢?比方講一七四四年瑞士大數學家和物理學家萊昂哈德‧歐拉 Leonhard Euler 在《尋找具有極大值或極小值性質的曲線,等周問題的最廣義解答》 Methodus inveniendi lineas curvas maximi minimive proprietate gaudentes, sive solutio problematis isoperimetrici lattissimo sensu accepti 論文中,非常清晰明白的給出『最小作用量原理』的定義

假使一個質量為 M,速度為 v 的粒子移動無窮小距離 ds 時。這時粒子的動量為 M \cdot v,當乘以此無窮小距離 ds 後,給出 M \cdot v \ ds ,這是粒子的動量作用於無窮小『路徑ds 距離之上。我宣稱︰在所有連結『始終』兩個端點的可能『路徑』之中,這個粒子運動的真實『軌跡』是 \int_{initial}^{final}  M \cdot v \ ds 為最小值的『路徑』;如果假定質量是個常數,也就是\int_{initial}^{final}  v \ ds 為最小值的『軌道』。

也就是說,在所有連結『始終』兩個端點的可能『路徑path 之中, 粒子所選擇的『路徑』是『作用量A = \int_{path}  M \cdot v \ ds 泛函數的『極值』,這是牛頓第二運動定律的『變分法』Variation 描述。如果從今天物理能量的觀點來看 A = \int_{path}  M \cdot v \ ds = \int_{path}  M \cdot v \ \frac {ds}{dt} dt = \int_{path}  M \cdot v^2 dt = 2 \int_{path} T dt,此處 T = \frac{1}{2} M v^2 就是粒子的動能。因為牛頓第二運動定律可以表述為 F = M \cdot a = \frac {d P}{dt}, \ P = M \cdot v,所以 \int_{path}  \frac {d P}{dt} ds = \int_{path}  \frac {d s}{dt} dP = \int_{path}  v dP  = \Delta T = \int_{path}  F ds

假使粒子所受的力是『保守力』conservative force,也就是講此力沿著任何路徑所作的『』work 只跟粒子『始終』兩個端點的『位置』有關,與它行經的『路徑』無關。在物理上這時通常將它定義成這個『力場』的『位能V = - \int_{ref}^{position}  F ds,於是如果一個粒子在一個保守場中,\Delta T + \Delta V = 0,這就是物理上『能量守恆』原理!舉例來說重力、彈簧力、電場力等等,都是保守力,然而摩擦力和空氣阻力種種都是典型的非保守力。由於 \Delta V 在這些可能路徑裡都不變,因此『最小作用量原理』所確定的『路徑』也就是『作用量A 的『極值』。一七八八年法國籍義大利裔數學家和天文學家約瑟夫‧拉格朗日 Joseph Lagrange 對於變分法發展貢獻很大,最早在其論文《分析力學》Mecanique Analytique 裡,使用『能量守恆定律』推導出了歐拉陳述的最小作用量原理的正確性。

───

 

提及了『拉格朗日』之功業,今稱之為『拉格朗日乘數』者︰

Lagrange multiplier

In mathematical optimization, the method of Lagrange multipliers (named after Joseph Louis Lagrange[1]) is a strategy for finding the local maxima and minima of a function subject to equality constraints.

For instance (see Figure 1), consider the optimization problem

maximize f(x, y)
subject to g(x, y) = 0.

We need both f and g to have continuous first partial derivatives. We introduce a new variable (λ) called a Lagrange multiplier and study the Lagrange function (or Lagrangian) defined by

 \mathcal{L}(x,y,\lambda) = f(x,y) - \lambda \cdot g(x,y),

where the λ term may be either added or subtracted. If f(x0, y0) is a maximum of f(x, y) for the original constrained problem, then there exists λ0 such that (x0, y0, λ0) is a stationary point for the Lagrange function (stationary points are those points where the partial derivatives of \mathcal{L} are zero). However, not all stationary points yield a solution of the original problem. Thus, the method of Lagrange multipliers yields a necessary condition for optimality in constrained problems.[2][3][4][5][6] Sufficient conditions for a minimum or maximum also exist.

300px-LagrangeMultipliers3D

Figure 1: Find x and y to maximize f(x, y) subject to a constraint (shown in red) g(x, y) = c.

LagrangeMultipliers2D.svg

Figure 2: Contour map of Figure 1. The red line shows the constraint g(x, y) = c. The blue lines are contours of f(x, y). The point where the red line tangentially touches a blue contour is the solution. Since d1 > d2, the solution is a maximization of f(x, y).

……

拉格朗日乘數的運用方法

f定義為在Rn上的方程,約束為gkx)= ck(或將約束左移得到gk(x) − ck = 0)。定義拉格朗日Λ

\Lambda(\mathbf x, \boldsymbol \lambda) = f + \sum_k \lambda_k(g_k-c_k)

注意極值的條件和約束現在就都被記錄到一個式子裡了:

\nabla \Lambda = 0 \Leftrightarrow \nabla f = - \sum_k \lambda_k \nabla\ g_k,

\nabla_{\mathbf \lambda} \Lambda = 0 \Leftrightarrow g_k = c_k

拉格朗日乘數常被用作表達最大增長值。原因是從式子:

-\frac{\partial \Lambda}{\partial {c_k}} = \lambda_k

中我們可以看出λk是當方程在被約束條件下,能夠達到的最大增長率。拉格朗日力學就使用到這個原理。

拉格朗日乘數法在卡羅需-庫恩-塔克條件被推廣。

───

Example 3: Entropy

Suppose we wish to find the discrete probability distribution on the points \{p_1, p_2, \cdots, p_n\} with maximal information entropy. This is the same as saying that we wish to find the least biased probability distribution on the points \{p_1, p_2, \cdots, p_n\}. In other words, we wish to maximize the Shannon entropy equation:

f(p_1,p_2,\cdots,p_n) = -\sum_{j=1}^n p_j\log_2 p_j.

For this to be a probability distribution the sum of the probabilities  p_i at each point x_i must equal 1, so our constraint is:

g(p_1,p_2,\cdots,p_n)=\sum_{j=1}^n p_j = 1.

We use Lagrange multipliers to find the point of maximum entropy, \vec{p}^{\,*}, across all discrete probability distributions \vec{p} on \{x_1,x_2, \cdots, x_n\}. We require that:

\left.\frac{\partial}{\partial \vec{p}}(f+\lambda (g-1))\right|_{\vec{p}=\vec{p}^{\,*}}=0,

which gives a system of n equations,  k ={1,\cdots,n}, such that:

\left.\frac{\partial}{\partial p_k}\left\{-\left (\sum_{j=1}^n p_j \log_2 p_j \right ) + \lambda \left(\sum_{j=1}^n p_j - 1\right) \right\}\right|_{p_k=p^*_k} = 0.

Carrying out the differentiation of these n equations, we get

-\left(\frac{1}{\ln 2}+\log_2 p^*_k \right) + \lambda = 0.

This shows that all p^*_k are equal (because they depend on λ only). By using the constraint

\sum_j p_j =1,

we find

p^*_k = \frac{1}{n}.

Hence, the uniform distribution is the distribution with the greatest entropy, among distributions on n points.

───

 

實首啟此法的『門徑』也??!!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】二

繼『大象說』之後 Michael Nielsen 接著講『過適』 overfitting 現象似乎順理成章︰

Let’s sharpen this problem up by constructing a situation where our network does a bad job generalizing to new situations. We’ll use our 30 hidden neuron network, with its 23,860 parameters. But we won’t train the network using all 50,000 MNIST training images. Instead, we’ll use just the first 1,000 training images. Using that restricted set will make the problem with generalization much more evident. We’ll train in a similar way to before, using the cross-entropy cost function, with a learning rate of \eta = 0.5 and a mini-batch size of 10. However, we’ll train for 400 epochs, a somewhat larger number than before, because we’re not using as many training examples. Let’s use network2 to look at the way the cost function changes:

>>> import mnist_loader 
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2 
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost) 
>>> net.large_weight_initializer()
>>> net.SGD(training_data[:1000], 400, 10, 0.5, evaluation_data=test_data,
... monitor_evaluation_accuracy=True, monitor_training_cost=True)

Using the results we can plot the way the cost changes as the network learns*

*This and the next four graphs were generated by the program overfitting.py. :

This looks encouraging, showing a smooth decrease in the cost, just as we expect. Note that I’ve only shown training epochs 200 through 399. This gives us a nice up-close view of the later stages of learning, which, as we’ll see, turns out to be where the interesting action is.

Let’s now look at how the classification accuracy on the test data changes over time:

Again, I’ve zoomed in quite a bit. In the first 200 epochs (not shown) the accuracy rises to just under 82 percent. The learning then gradually slows down. Finally, at around epoch 280 the classification accuracy pretty much stops improving. Later epochs merely see small stochastic fluctuations near the value of the accuracy at epoch 280. Contrast this with the earlier graph, where the cost associated to the training data continues to smoothly drop. If we just look at that cost, it appears that our model is still getting “better”. But the test accuracy results show the improvement is an illusion. Just like the model that Fermi disliked, what our network learns after epoch 280 no longer generalizes to the test data. And so it’s not useful learning. We say the network is overfitting or overtraining beyond epoch 280.

………

 

縱然整段文本簡易,讀來輕鬆,又以『模擬』與『圖形』來作闡釋當無疑議矣!不過讀書常貴在『不疑處有疑』!!故有『問』也?第一位吃螃蟹的人如何能下口??

,真不知那吃 蟹 蟹的第一人,如何想來,又怎麼下得了口 !?噗哧一笑,陡地,

貓  貓的理論上心頭︰

通常人們都說不要『迷信』,那『科學』自己會不會也『變成』一種迷信呢?或者說所謂的科學又是『什麼』呢?因為並非一直以來科學就是『像今天』一樣,現今的科學有著這樣的『一種精神』︰

一、事實立論
二、在事實的基礎上,建立用來解說的假設,然後形成它的理論
三、人人時時方方都可實驗,嘗試驗證或者推翻 一與二之所說
四、保持懷疑設想現象創革工具;再次持續不斷想推翻一、二和三所言

不知這樣的精神能不能』『西』之分?還是有『』『』之別呢?

大科學家牛頓養了一隻貓,他到那因為進出門戶不方便快樂』,所以在門上打了個洞,果然貓就快樂了起來。多年之後那貓生了小貓,牛頓很高興的在那個洞的旁邊又打了一個小洞,這樣小貓也一定會很『快樂』的了。真不知牛頓如何想出這個『貓的理論』︰

大貓走大洞;小貓走小洞。

難道『小貓』就不可大洞』?還是不這樣小貓』就會快樂』??

,何不效法牛頓,『大』洞『小』洞『快樂』的打,一『洞』通了是『一』洞,早晚總能『洞穿』!!☿☺

─── 摘自《M♪o 之學習筆記本《辰》組元︰【䷠】黃牛之革

 

談此『現象』的第一人又怎麼『發現』的呢??是因為《大象說》聯想到奧卡姆之『剃刀』︰

奧卡姆剃刀英語:Occam’s Razor, Ockham’s Razor),又稱「奧坎的剃刀」,拉丁文為lex parsimoniae,意思是簡約之法則,是由14世紀邏輯學家、聖方濟各會修士奧卡姆的威廉(William of Occam,約1287年至1347年,奧卡姆(Ockham)位於英格蘭薩里郡) 提出的一個解決問題的法則,他在《箴言書注》2卷15題說「切勿浪費較多東西,去做『用較少的東西,同樣可以做好的事情 』。」換一種說法,如果關於同一個 問題有許多種理論,每一種都能作出同樣準確的預言,那麼應該挑選其中使用假定最少的。儘管越複雜的方法通常能作出越好的預言,但是在不考慮預言能力的情況下,前提假設越少越好。

所羅門諾夫的歸納推理理論是奧卡姆剃刀的數學公式化:[1][2][3][4][5][6]在所有能夠完美描述已有觀測的可計算理論中,較短的可計算理論在估計下一次觀測結果的機率時具有較大權重。

在自然科學中,奧卡姆剃刀被作為啟發法技巧來使用,更多地作為幫助科學家發展理論模型的工具,而不是在已經發表的理論之間充當裁判角色。[7][8]科學方法中,奧卡姆剃刀並沒有被當做邏輯上不可辯駁的定理或者科學結論。在科學方法中對簡單性的偏好,是基於可證偽性的標準。對於某個現象的所有可接受的解釋,都存在無數個可能的、更為複雜的變體:因為你可以把任何解釋中的錯誤歸結於特例假設,從而避免該錯誤的發生。所以,較簡單的理論比複雜的理論更好,因為它們更加可檢驗。[9][10][11]

250px-Heliocentric

安德烈亞斯·塞拉里烏斯所繪製的哥白尼系統,見於《和諧大宇宙》(1708)。太陽、月亮和其他太陽系行星的運動既可以用地心說來解釋,也可以用日心說來解釋,都同樣有效,然而日心說只需要7個基本假設,地心說卻需要多得多的假設。在尼古拉·哥白尼的《天體運行論》序言中指出了這一點

……

數學

奧卡姆剃刀的形式之一,是基礎機率論的直接結果。根據定義,任何假設都會帶來犯錯誤機率的增加;如果一個假設不能增加理論的正確率,那麼它的唯一作用就是增加整個理論為錯誤的機率。

還有另一些從機率論理論得出奧卡姆剃刀的嘗試,包括哈羅德·傑弗里斯埃德溫·托普森·傑納斯的著名嘗試。奧卡姆剃刀的(貝葉斯)機率基礎,是由大衛·麥克卡伊在他的著作《資訊理論、推理和學習算法》(Information Theory, Inference, and Learning Algorithms)的第28章里給出,[30]他強調了,並不需要事先給予簡單模型一個較高的偏好值。

威廉·傑弗里斯(和哈羅德·傑弗里斯沒有關係)和詹姆斯·貝爾格爾(1991)總結和評價了原版剃刀法則中的「假設」概念。對於可能觀察到的數據來說,它是一個命題的無必要程度。[31]他們主張:「一個可調參數較少的假設,自然地會擁有較高的後驗機率,因為它所作出的預言會更精確。[31]他們所提出的模型,在理論的預測準確性和精確度之間尋求均衡:精確地作出正確的預言的理論,優於給出一個大的猜測範圍的或者不正確的理論。這再次反映了貝葉斯推斷中的核心概念(邊緣分布條件機率後驗機率)之間的聯繫 。

220px-Leprechaun_or_Clurichaun

可能存在無必要的複雜解釋。例如,可以將矮精靈拉布列康加入任何解釋中,但是奧卡姆剃刀阻止了這樣的添加,除非它有必要

───

 

所以神通『過適』之大門耶!!

統計學中,過適英語:overfitting,或稱過度擬合現象是指在調適一個統計模型時,使用過多參數。對比於可取得的資料總量來說,一個荒謬的模型只要足夠複雜,是可以完美地適應資料。過適一般可以識為違反奧卡姆剃刀原 則。當可選擇的參數的自由度超過資料所包含資訊內容時,這會導致最後(調適後)模型使用任意的參數,這會減少或破壞模型一般化的能力更甚於適應資料。過適 的可能性不只取決於參數個數和資料,也跟模型架構與資料的一致性有關。此外對比於資料中預期的雜訊或錯誤數量,跟模型錯誤的數量也有關。

過適現象的觀念對機器學習也是很重要的。通常一個學習演算法是藉由訓練範例來訓練的。亦即預期結果的範例是可知的。而學習者則被認為須達到可以預測出其它範例的正確的結果,因此,應適用於一般化的情況而非只是訓練時所使用的現有資料(根據它的歸納偏向)。然而,學習者卻會去適應訓練資料中太特化但又隨機的特徵,特別是在當學習過程太久或範例太少時。在過適的過程中,當預測訓練範例結果的表現增加時,應用在未知資料的表現則變更差 。

在統計和機器學習中,為了避免過適現象,須要使用額外的技巧(如交叉驗證early stopping貝斯信息量準則赤池信息量準則model comparison),以指出何時會有更多訓練而沒有導致更好的一般化。人工神經網路的過適過程亦被認知為過度訓練(英語: overtraining)。在treatmeant learning中,使用最小最佳支援值(英語:minimum best support value)來避免過適。

相對於過適是指,使用過多參數,以致太適應資料而非一般情況,另一種常見的現象是使用太少參數,以致於不適應資料,這則稱為乏適英語:underfitting,或稱:擬合不足)現象。

───

Overfitted_Data

Noisy (roughly linear) data is fitted to both linear and polynomial functions. Although the polynomial function is a perfect fit, the linear version can be expected to generalize better. In other words, if the two functions were used to extrapolate the data beyond the fit data, the linear function would make better predictions.

……

Machine learning

Usually a learning algorithm is trained using some set of “training data”: exemplary situations for which the desired output is known. The goal is that the algorithm will also perform well on predicting the output when fed “validation data” that was not encountered during its training.

Overfitting is the use of models or procedures that violate Occam’s razor, for example by including more adjustable parameters than are ultimately optimal, or by using a more complicated approach than is ultimately optimal. For an example where there are too many adjustable parameters, consider a dataset where training data for y can be adequately predicted by a linear function of two dependent variables. Such a function requires only three parameters (the intercept and two slopes). Replacing this simple function with a new, more complex quadratic function, or with a new, more complex linear function on more than two dependent variables, carries a risk: Occam’s razor implies that any given complex function is a priori less probable than any given simple function. If the new, more complicated function is selected instead of the simple function, and if there was not a large enough gain in training-data fit to offset the complexity increase, then the new complex function “overfits” the data, and the complex overfitted function will likely perform worse than the simpler function on validation data outside the training dataset, even though the complex function performed as well, or perhaps even better, on the training dataset.[2]

When comparing different types of models, complexity cannot be measured solely by counting how many parameters exist in each model; the expressivity of each parameter must be considered as well. For example, it is nontrivial to directly compare the complexity of a neural net (which can track curvilinear relationships) with m parameters to a regression model with n parameters.[2]

Overfitting is especially likely in cases where learning was performed too long or where training examples are rare, causing the learner to adjust to very specific random features of the training data, that have no causal relation to the target function. In this process of overfitting, the performance on the training examples still increases while the performance on unseen data becomes worse.

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It’s easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes; but this model will not generalize at all to new data, because those past times will never occur again.

Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight). One can intuitively understand overfitting from the fact that information from all past experience can be divided into two groups: information that is relevant for the future and irrelevant information (“noise”). Everything else being equal, the more difficult a criterion is to predict (i.e., the higher its uncertainty), the more noise exists in past information that needs to be ignored. The problem is determining which part to ignore. A learning algorithm that can reduce the chance of fitting noise is called robust.

300px-Overfitting_svg.svg

Overfitting/overtraining in supervised learning (e.g., neural network). Training error is shown in blue, validation error in red, both as a function of the number of training cycles. If the validation error increases(positive slope) while the training error steadily decreases(negative slope) then a situation of overfitting may have occurred. The best predictive and fitted model would be where the validation error has its global minimum.

───

 

總需知『以偏概全』 是誤謬,『以全蓋偏』易犯『區群謬誤』也︰

區群謬誤Ecological fallacy),又稱生態謬誤層次謬誤,是一種在分析統計資料時常犯的錯誤。和以偏概全相反,區群謬誤是一種以全概偏,如果僅基於群體的統計數據就對其下屬的個體性質作出推論,就是犯上區群謬誤。這謬誤假設了群體中的所有個體都有群體的性質(因此塑型(Sterotypes)也可能犯上區群謬誤)。區群謬誤的相反情況為化約主義Reductionism)。

起源

Ecological fallacy這名詞最先見於William S. Robinson在1950年的文章[1]。在1930年美國人口普查結果中,Robinson分析了48個識字率以及新移民人 口比例的關係。他發現兩者之間的相關係數為0.53,即代表若一個州的新移民比率愈高,平均來說這個州的識字率便愈高。但當分析個體資料時,便發現相關係 數便是-0.11,即平均來說新移民比本地人的識字率低。出現這種看似矛盾的結果,其實是因為新移民都傾向在識字率較高的州份定居。Robinson因此 提出在處理群體資料,或區群資料時,必須注意到資料對個體的適用性。

這並非指任何以群體資料對個體性質作出的推論都是錯誤的,但在推論時必須注意群體資料會否把群體內的變異隱藏起來。

───

 

刻板印象』導致過度簡化之情事,奈何廣告卻愛用的呢?不管『夢露之裙』或『 鉤足之吻』純是老梗之 stereotype 乎!

1280px-18th_century_ethnography

一張十八世紀時荷蘭的圖畫,裡頭將亞洲、美洲及非洲的人描述成野蠻人,下方呈現的則分別是英國人、荷蘭人、德國人和法國人。

 

明眼者能不謹慎耶???

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】一

俗諺說︰『天要下雨 ,娘要嫁人』何謂也?這事徐尚禮先生曾經在中時電子報上解釋過。假使『百度』一下,也許那『娘』娘當真是『純潔美好』之『姑娘』也!若考之以漢字淵源︰

說文解字

孃:煩擾也。一曰肥大也。从女襄聲。女良切

《說文解字注》

(孃) 煩www.zdic.net也。煩、熱頭痛也。www.zdic.net、煩也。今人用擾攘字、古用孃。賈誼傳作搶攘。莊子在宥作傖囊。楚詞作恇攘。俗作劻勷。皆用叚借字耳。今攘行而孃廢矣。又按廣韵孃女良切、母稱。娘亦女良切 、少女之号。唐人此二字分用畫然。故耶孃字斷無有作娘者。今人乃罕知之矣。一曰肥大也。方言。www.zdic.net-1、盛也。秦晉或曰www.zdic.net-1。凡人言盛及其所愛偉其肥晠謂之www.zdic.net-1。郭注云。肥多肉。按肉部旣有字矣。此與彼音義皆同也。漢書。壤子王梁、代。壤卽www.zdic.net-1孃字。从女。襄聲。女良切。十部。按前後二義皆當音壤。

 

誠然『娘』有『少女心』耶??如此『天要下雨』必然發生,正對少女懷春『娘要嫁人』無法阻攔乎!!故知於一定之『時空』條件下,『娘要嫁人』之『緣』或等於『天要下雨 』之『因』,那麼這兩者的『機遇』能不相同的嗎??!!如是亦可知『蘇格拉底』之『不得不死』矣!!??

蘇格拉底之死

蘇格拉底之死法語:La Mort de Socrate)是法國畫家新古典主義畫派的奠基人雅克-路易·大衛於1787年創作的一幅油畫。和大衛同一時期的其他畫作一樣,《蘇格拉底之死》也採用了古典的主題:柏拉圖在《斐多篇》中所記錄的蘇格拉底之死。畫中鎮定自若 、一如既往討論哲學的蘇格拉底使人崇敬,而他周圍哀慟不已的朋友們增添了畫面的悲劇性,使畫面獲得了凝重、剛毅、冷峻的藝術效果[1]

背景與創作

公元前399年,七十高齡的蘇格拉底被控不敬神和腐蝕雅典的年輕人,被判處服毒而死。按照《斐多篇》的記載,面對死亡,蘇格拉底非常平靜,一如既往地和弟子克里同斐多、底比斯來的西米亞斯克貝等人進行哲學討論,只不過主題成了死亡是什麼和死亡之後如何,蘇格拉底認為靈魂不朽,將死亡看作一個另外的王國,一個和塵世不同的地方,而非存在的終結[2]

1758年狄德羅發表了自己的論《戲劇詩》,認為蘇格拉底之死的場景適合作為啞劇的主題,之後描繪這一主題的畫作多次出現[3]。1775-1780年,大衛第一次去羅馬旅行,他研究了對葬禮儀式的描繪,畫下了很多草稿。大衛的很多主要作品都來源於這些素描[4]。大衛在1782年已經創作了這幅畫的草稿,在1786年接受了特胡丹·德蒙蒂尼次子的委託,開始創作這幅畫[5][6]1787年完成後在沙龍展覽,受到不少藝術家的歡迎,當時正在巴黎的托馬斯·傑斐遜也認為這幅畫充分體現了新古典主義的壯麗理想,在給美國歷史畫家約翰·杜倫波的信中提到「展覽中最好的作品是大衛的《蘇格拉底之死》,一幅極好的作品」[7]

 

David_-_The_Death_of_Socrates

藝術家 雅克-路易·大衛
年代 1787年完成
類型 油畫
大小 129.5 厘米 cm × 196.2 厘米 cm(?? × ??
位置 紐約大都會藝術博物館

 

所以能明白 Michael Nielsen 先生為何會突起『費米』之『大象說』的了︰

Overfitting and regularization

The Nobel prizewinning physicist Enrico Fermi was once asked his opinion of a mathematical model some colleagues had proposed as the solution to an important unsolved physics problem. The model gave excellent agreement with experiment, but Fermi was skeptical. He asked how many free parameters could be set in the model. “Four” was the answer. Fermi replied*

*The quote comes from a charming article by Freeman Dyson, who is one of the people who proposed the flawed model. A four-parameter elephant may be found here.

: “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”.

The point, of course, is that models with a large number of free parameters can describe an amazingly wide range of phenomena. Even if such a model agrees well with the available data, that doesn’t make it a good model. It may just mean there’s enough freedom in the model that it can describe almost any data set of the given size, without capturing any genuine insights into the underlying phenomenon. When that happens the model will work well for the existing data, but will fail to generalize to new situations. The true test of a model is its ability to make predictions in situations it hasn’t been exposed to before.

Fermi and von Neumann were suspicious of models with four parameters. Our 30 hidden neuron network for classifying MNIST digits has nearly 24,000 parameters! That’s a lot of parameters. Our 100 hidden neuron network has nearly 80,000 parameters, and state-of-the-art deep neural nets sometimes contain millions or even billions of parameters. Should we trust the results?

───

 

還特意給了個鍊結

How to fit an elephant

John von Neumann famously said

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

By this he meant that one should not be impressed when a complex model fits a data set well. With enough parameters, you can fit any data set.

It turns out you can literally fit an elephant with four parameters if you allow the parameters to be complex numbers.

I mentioned von Neumann’s quote on StatFact last week and Piotr Zolnierczuk replied with reference to a paper explaining how to fit an elephant:

“Drawing an elephant with four complex parameters” by Jurgen Mayer, Khaled Khairy, and Jonathon Howard,  Am. J. Phys. 78, 648 (2010), DOI:10.1119/1.3254017.

Piotr also sent me the following Python code he’d written to implement the method in the paper. This code produced the image above.

───

 

大概不會是潛意識裡希望『拯救大象』的吧??又怎知卻挑起人對『大象林旺』之幽思呢!!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Softmax】三

佳人》杜甫

絕代有佳人,幽居在空谷。
自雲良家子,零落依草木。
關中昔喪敗,兄弟遭殺戮。
官高何足論?不得收骨肉。
世情惡衰歇,萬事隨轉燭。
夫婿輕薄兒,新人美如玉。
合昏尚知時,鴛鴦不獨宿。
但見新人笑,那聞舊人哭?
在山泉水清,出山泉水濁。
侍婢賣珠回,牽蘿補茅屋。
摘花不插發,採柏動盈掬。
天寒翠袖薄,日暮倚修竹。

 

絕代佳人奈何ㄉㄟˇ『牽蘿補茅屋』?這一個『補』補字

《説文解字》:補,完衣也。从衣,甫聲。

道盡亂世之艱難、人情的薄涼!衣破得『補』以求完好如初,屋漏得『補』以求遮風避雨,為何『學習』也得『補』呢??莫非曾經

坤 ䷁

六二:直,方,大,不習無不利。

如今得『補』之以

君子攸行,先迷失道,后順得常。

既然 Michael Nielsen 先生用意外之筆寫了『Softmax』串場文字,明指絕非龍套之事。因此為求『完備』,作者只能應事『補』上『史丹佛大學』之

Softmax Regression


Introduction

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: y^{(i)} \in \{0,1\}. We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle y^{(i)} \in \{1,\ldots,K\} where K is the number of classes.

Recall that in logistic regression, we had a training set \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \} of m labeled examples, where the input features are x^{(i)} \in \Re^{n}. With logistic regression, we were in the binary classification setting, so the labels were y^{(i)} \in \{0,1\}. Our hypothesis took the form:

h_\theta(x) = \frac{1}{1+\exp(-\theta^\top x)},

and the model parameters \theta were trained to minimize the cost function

J(\theta) = -\left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]

In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label y can take on K different values, rather than only two. Thus, in our training set \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}, we now have that y^{(i)} \in \{1, 2, \ldots, K\}. (Note that our convention will be to index the classes starting from 1, rather than from 0.) For example, in the MNIST digit recognition task, we would have K=10 different classes.

Given a test input x, we want our hypothesis to estimate the probability that P(y=k | x) for each value of k = 1, \ldots, K. I.e., we want to estimate the probability of the class label taking on each of the K different possible values. Thus, our hypothesis will output a K-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our hypothesis h_{\theta}(x) takes the form:

h_\theta(x) = \begin{bmatrix} P(y = 1 | x; \theta) \\ P(y = 2 | x; \theta) \\ \vdots \\ P(y = K | x; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\theta^{(j)\top} x) }} \begin{bmatrix} \exp(\theta^{(1)\top} x ) \\ \exp(\theta^{(2)\top} x ) \\ \vdots \\ \exp(\theta^{(K)\top} x ) \\ \end{bmatrix}

Here \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(K)} \in \Re^{n} are the parameters of our model. Notice that the term \frac{1}{ \sum_{j=1}^{K}{\exp(\theta^{(j)\top} x) } } normalizes the distribution, so that it sums to one.

For convenience, we will also write \theta to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent \theta as a n-by-K matrix obtained by concatenating \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(K)} into columns, so that

\theta = \left[\begin{array}{cccc}| & | & | & | \\ \theta^{(1)} & \theta^{(2)} & \cdots & \theta^{(K)} \\ | & | & | & | \end{array}\right].

Cost Function

We now describe the cost function that we’ll use for softmax regression. In the equation below, 1\{\cdot\} is the ”‘indicator function,”’ so that 1\{\hbox{a true statement}\}=1, and 1\{\hbox{a false statement}\}=0. For example, 1\{2+2=4\} evaluates to 1; whereas 1\{1+1=5\} evaluates to 0. Our cost function will be:

J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \log \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}\right]

Notice that this generalizes the logistic regression cost function, which could also have been written:

J(\theta) &= - \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ &= - \left[ \sum_{i=1}^{m} \sum_{k=0}^{1} 1\left\{y^{(i)} = k\right\} \log P(y^{(i)} = k | x^{(i)} ; \theta) \right]

The softmax cost function is similar, except that we now sum over the K different possible values of the class label. Note also that in softmax regression, we have that

P(y^{(i)} = k | x^{(i)} ; \theta) = \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) }

.

We cannot solve for the minimum of J(\theta) analytically, and thus as usual we’ll resort to an iterative optimization algorithm. Taking derivatives, one can show that the gradient is:

\nabla_{\theta^{(k)}} J(\theta) = - \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}; \theta) \right) \right] }

Recall the meaning of the ”\nabla_{\theta^{(k)}}” notation. In particular, \nabla_{\theta^{(k)}} J(\theta) is itself a vector, so that its j-th element is \frac{\partial J(\theta)}{\partial \theta_{lk}} the partial derivative of J(\theta) with respect to the j-th element of \theta^{(k)}.

Armed with this formula for the derivative, one can then plug it into a standard optimization package and have it minimize J(\theta).

………

 

,冀望觀者能目千里,用者能得他山之石也。

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Softmax】二

《史記/卷085》呂不韋

呂不韋者,陽翟大賈人也。往來販賤賣貴,家累千金。

……

呂不韋列… : 當是時,魏有信陵君,楚有春申君,趙有平原君,齊有孟嘗君,皆下士喜賓客以相傾。呂不韋以秦之彊,羞不如,亦招致士,厚遇之,至食客三千人。是時諸侯多辯士,如荀卿之徒,著書布天下。呂不韋乃使其客人人著所聞,集論以為八覽、六論、十二紀,二十餘萬言。以為備天地萬物古今之事,號曰呂氏春秋。布咸陽市門,懸千金其上,延諸侯游士賓客有能增損一字者予千金

───

 

成語說一字千金, Michael Nielsen 先生果然惜字如金,因此這一段文字易讀難通!!不得不加註貫串也??

The learning slowdown problem: We’ve now built up considerable familiarity with softmax layers of neurons. But we haven’t yet seen how a softmax layer lets us address the learning slowdown problem. To understand that, let’s define the log-likelihood cost function. We’ll use x to denote a training input to the network, and y to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is

C \equiv -\ln a^L_y. \ \ \ \ (80)

So, for instance, if we’re training with MNIST images, and input an image of a 7, then the log-likelihood cost is -\ln a^L_7. To see that this makes intuitive sense, consider the case when the network is doing a good job, that is, it is confident the input is a 7. In that case it will estimate a value for the corresponding probability a^L_7 which is close to 1, and so the cost -\ln a^L_7 will be small. By contrast, when the network isn’t doing such a good job, the probability a^L_7 will be smaller, and the cost -\ln a^L_7 will be larger. So the log-likelihood cost behaves as we’d expect a cost function to behave.

 

這個 C \equiv -\ln a^L_y 式子寫法或能達意,但是若從函式定義上講恐不正式。嚴謹一些的說,可以用『指示函數』

Indicator function

In mathematics, an indicator function or a characteristic function is a function defined on a set X that indicates membership of an element in a subset A of X, having the value 1 for all elements of A and the value 0 for all elements of X not in A. It is usually denoted by a symbol 1 or I, sometimes in boldface or blackboard boldface, with a subscript describing the set.

Definition

The indicator function of a subset A of a set X is a function

\mathbf{1}_A \colon X \to \{ 0,1 \} \,

defined as

\mathbf{1}_A(x) := \begin{cases} 1 &\text{if } x \in A, \\ 0 &\text{if } x \notin A. \end{cases}

The Iverson bracket allows the equivalent notation, [x\in A], to be used instead of \mathbf{1}_A(x).

The function \mathbf{1}_A is sometimes denoted I_A, \chi_A or even just A. (The Greek letter \chi appears because it is the initial letter of the Greek word characteristic.)

───

 

記作

C_T \equiv \frac{1}{n} \sum \limits_x \sum \limits_{\alpha = 0}^{9} 1_{y_{\alpha}(x) = 1}  \cdot \left( - \ln (a^L_{\alpha}) \right)

。此處 1_{y_{\alpha}(x) = 1} 意指︰如果訓練資料『輸入樣本』是 x ,與此對應之『正確輸出』 \vec y 向量的 \alpha 分量值為 1,這時則取值為 1 ,否則取值為 0。而 n 是樣本總數。如是也可以明確『價格函式』

C = C_x = \sum \limits_{\alpha = 0}^{9} 1_{y_{\alpha}(x) = 1}  \cdot \left( - \ln (a^L_{\alpha}) \right)

之意義矣。雖然 C_x 貌似是十項 - \ln (a^L_j) 之和,事實上對於任一『輸入樣本』 x 而言,終究只有一項 a^L_{\alpha} 不為零。

 

What about the learning slowdown problem? To analyze that, recall that the key to the learning slowdown is the behaviour of the quantities \partial C / \partial w^L_{jk} and \partial C / \partial b^L_j. I won’t go through the derivation explicitly – I’ll ask you to do in the problems, below – but with a little algebra you can show that*

*Note that I’m abusing notation here, using y in a slightly different way to last paragraph. In the last paragraph we used y to denote the desired output from the network – e.g., output a “7” if an image of a 7 was input. But in the equations which follow I’m using y to denote the vector of output activations which corresponds to 7, that is, a vector which is all 0s, except for a 1 in the 7th location.

\frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j \ \ \ \ (81)
\frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \ \ \ \ (82)

 

此式考察

Softmax function

Artificial neural networks

In neural network simulations, the softmax function is often implemented at the final layer of a network used for classification. Such networks are then trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

 \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \dots = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k))

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

───

 

\delta_{ik}1_{y_{\alpha}(x) = 1} 之定義關係以及利用微分『鏈式法則』自可得乎 ??

These equations are the same as the analogous expressions obtained in our earlier analysis of the cross-entropy. Compare, for example, Equation (82) to Equation (67). It’s the same equation, albeit in the latter I’ve averaged over training instances. And, just as in the earlier analysis, these expressions ensure that we will not encounter a learning slowdown. In fact, it’s useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well. Through the remainder of this chapter we’ll use a sigmoid output layer, with the cross-entropy cost. Later, in Chapter 6, we’ll sometimes use a softmax output layer, with log-likelihood cost. The reason for the switch is to make some of our later networks more similar to networks found in certain influential academic papers. As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That’s not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.

Problems

  • Derive Equations (81) and (82).
  • Where does the “softmax” name come from? Suppose we change the softmax function so the output activations are given by
    a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}, \ \ \ \ (83)

    where c is a positive constant. Note that c = 1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c \rightarrow \infty. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a “softened” version of the maximum function. This is the origin of the term “softmax”.

  • Backpropagation with softmax and the log-likelihood cost In the last chapter we derived the backpropagation algorithm for a network containing sigmoid layers. To apply the algorithm to a network with a softmax layer we need to figure out an expression for the error \delta^L_j \equiv \partial C / \partial z^L_j in the final layer. Show that a suitable expression is:
    \delta^L_j = a^L_j -y_j. \ \ \ \ (84)

    Using this expression we can apply the backpropagation algorithm to a network using a softmax output layer and the log-likelihood cost.

 

亦當能知為何應取 \delta^L_j = a^L_j -y_j 的了!!若問

a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}}

c \rightarrow \infty 之『極限值』???假設\sum_k e^{c z^L_k} 裡, cz^L_{\alpha} 為最大。由於

\lim \limits_{c \to \infty} \frac{e^{c z^L_j}}{e^{c z^L_{\alpha}}} = 1 \ \ \ if  \ j =\alpha

\lim \limits_{c \to \infty} \frac{e^{c z^L_j}}{e^{c z^L_{\alpha}}} = 0 \ \ \ if  \ j \neq \alpha

故知 \vec {a}^{\ L} 向量只有 \alpha 分量為 1 ,其餘皆 0 的耶!!!

至於什麼是『Likelihood function』呢?多講怕離題太遠大而無當,且容遁之以維基百科詞條矣!

似然函數

數理統計學中,似然函數是一種關於統計模型中的參數函數,表示模型參數中的似然性。似然函數在統計推斷中有重大作用,如在最大似然估計費雪信息之中的應用等等。「似然性」與「或然性」或「機率」意思相近,都是指某種事件發生的可能性,但是在統計學中,「似然性」和「或然性」或「機率」又有明確的區分。機率用於在已知一些參數的情況下,預測接下來的觀測所得到的結果,而似然性則是用於在已知某些觀測所得到的結果時,對有關事物的性質的參數進行估計。

在這種意義上,似然函數可以理解為條件機率的逆反。在已知某個參數B時,事件A會發生的機率寫作:

P(A \mid B) = \frac{P(A , B)}{P(B)} \!

利用貝氏定理

P(B \mid A) = \frac{P(A \mid B)\;P(B)}{P(A)} \!

因此,我們可以反過來構造表示似然性的方法:已知有事件A發生,運用似然函數\mathbb{L}(B \mid A),我們估計參數B的可能性。形式上,似然函數也是一種條件機率函數,但我們關注的變量改變了:

b\mapsto P(A \mid B=b) \!

注意到這裡並不要求似然函數滿足歸一性:\sum_{b \in \mathcal{B}}P(A \mid B=b) = 1。一個似然函數乘以一個正的常數之後仍然是似然函數。對所有\alpha > 0,都可以有似然函數:

L(b \mid A) = \alpha \; P(A \mid B=b) \!

例子

考慮投擲一枚硬幣的實驗。通常來說,已知投出的硬幣正面朝上和反面朝上的機率各自是p_H = 0.5,便可以知道投擲若干次後出現各種結果的可能性。比如說,投兩次都是正面朝上的機率是0.25。用條件機率表示,就是:

P(\mbox{HH} \mid p_H = 0.5) = 0.5^2 = 0.25

其中H表示正面朝上。

在統計學中,我們關心的是在已知一系列投擲的結果時,關於硬幣投擲時正面朝上的可能性的信息。
我們可以建立一個統計模型:假設硬幣投出時會有p_H 的機率正面朝上,而有1 - p_H的機率反面朝上。
這時,條件機率可以改寫成似然函數:

L(p_H = 0.5 \mid \mbox{HH}) = P(\mbox{HH}\mid p_H = 0.5) =0.25

也就是說,對於取定的似然函數,在觀測到兩次投擲都是正面朝上時,p_H = 0.5似然性是0.25(這並不表示當觀測到兩次正面朝上時p_H= 0.5機率是0.25)。

300px-LikelihoodFunctionAfterHH

兩次投擲都正面朝上時的似然函數

如果考慮p_H = 0.6,那麼似然函數的值也會改變。

L(p_H = 0.6 \mid \mbox{HH}) = P(\mbox{HH}\mid p_H = 0.6) =0.36

注意到似然函數的值變大了。
這說明,如果參數p_H的取值變成0.6的話,結果觀測到連續兩次正面朝上的機率要比假設p_H = 0.5時更大。也就是說,參數p_H取成0.6要比取成0.5更有說服力,更為「合理」。總之,似然函數的重要性不是它的具體取值,而是當參數變化時函數到底變小還是變大。對同一個似然函數,如果存在一個參數值,使得它的函數值達到最大的話,那麼這個值就是最為「合理」的參數值。

在這個例子中,似然函數實際上等於:

L(p_H = \theta \mid \mbox{HH}) = P(\mbox{HH}\mid p_H = \theta) =\theta^2 ,其中0 \le p_H \le 1

如果取p_H = 1,那麼似然函數達到最大值1。也就是說,當連續觀測到兩次正面朝上時,假設硬幣投擲時正面朝上的機率為1是最合理的。

類似地,如果觀測到的是三次投擲硬幣,頭兩次正面朝上,第三次反面朝上,那麼似然函數將會是:

L(p_H = \theta \mid \mbox{HHT}) = P(\mbox{HHT}\mid p_H = \theta) =\theta^2(1 - \theta) ,其中T表示反面朝上,0 \le p_H \le 1

這時候,似然函數的最大值將會在p_H = \frac{2}{3}的時候取到。也就是說,當觀測到三次投擲中前兩次正面朝上而後一次反面朝上時,估計硬幣投擲時正面朝上的機率p_H = \frac{2}{3}是最合理的。

300px-LikelihoodFunctionAfterHHT

三次投擲中頭兩次正面朝上,第三次反面朝上時的似然函數

───