W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】二

繼『大象說』之後 Michael Nielsen 接著講『過適』 overfitting 現象似乎順理成章︰

Let’s sharpen this problem up by constructing a situation where our network does a bad job generalizing to new situations. We’ll use our 30 hidden neuron network, with its 23,860 parameters. But we won’t train the network using all 50,000 MNIST training images. Instead, we’ll use just the first 1,000 training images. Using that restricted set will make the problem with generalization much more evident. We’ll train in a similar way to before, using the cross-entropy cost function, with a learning rate of \eta = 0.5 and a mini-batch size of 10. However, we’ll train for 400 epochs, a somewhat larger number than before, because we’re not using as many training examples. Let’s use network2 to look at the way the cost function changes:

>>> import mnist_loader 
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2 
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost) 
>>> net.large_weight_initializer()
>>> net.SGD(training_data[:1000], 400, 10, 0.5, evaluation_data=test_data,
... monitor_evaluation_accuracy=True, monitor_training_cost=True)

Using the results we can plot the way the cost changes as the network learns*

This looks encouraging, showing a smooth decrease in the cost, just as we expect. Note that I’ve only shown training epochs 200 through 399. This gives us a nice up-close view of the later stages of learning, which, as we’ll see, turns out to be where the interesting action is.

Let’s now look at how the classification accuracy on the test data changes over time:

Again, I’ve zoomed in quite a bit. In the first 200 epochs (not shown) the accuracy rises to just under 82 percent. The learning then gradually slows down. Finally, at around epoch 280 the classification accuracy pretty much stops improving. Later epochs merely see small stochastic fluctuations near the value of the accuracy at epoch 280. Contrast this with the earlier graph, where the cost associated to the training data continues to smoothly drop. If we just look at that cost, it appears that our model is still getting “better”. But the test accuracy results show the improvement is an illusion. Just like the model that Fermi disliked, what our network learns after epoch 280 no longer generalizes to the test data. And so it’s not useful learning. We say the network is overfitting or overtraining beyond epoch 280.




奧卡姆剃刀英語:Occam’s Razor, Ockham’s Razor),又稱「奧坎的剃刀」,拉丁文為lex parsimoniae,意思是簡約之法則,是由14世紀邏輯學家、聖方濟各會修士奧卡姆的威廉(William of Occam,約1287年至1347年,奧卡姆(Ockham)位於英格蘭薩里郡) 提出的一個解決問題的法則,他在《箴言書注》2卷15題說「切勿浪費較多東西,去做『用較少的東西,同樣可以做好的事情 』。」換一種說法,如果關於同一個 問題有許多種理論,每一種都能作出同樣準確的預言,那麼應該挑選其中使用假定最少的。儘管越複雜的方法通常能作出越好的預言,但是在不考慮預言能力的情況下,前提假設越少越好。








還有另一些從機率論理論得出奧卡姆剃刀的嘗試,包括哈羅德·傑弗里斯埃德溫·托普森·傑納斯的著名嘗試。奧卡姆剃刀的(貝葉斯)機率基礎,是由大衛·麥克卡伊在他的著作《資訊理論、推理和學習算法》(Information Theory, Inference, and Learning Algorithms)的第28章里給出,[30]他強調了,並不需要事先給予簡單模型一個較高的偏好值。

威廉·傑弗里斯(和哈羅德·傑弗里斯沒有關係)和詹姆斯·貝爾格爾(1991)總結和評價了原版剃刀法則中的「假設」概念。對於可能觀察到的數據來說,它是一個命題的無必要程度。[31]他們主張:「一個可調參數較少的假設,自然地會擁有較高的後驗機率,因為它所作出的預言會更精確。[31]他們所提出的模型,在理論的預測準確性和精確度之間尋求均衡:精確地作出正確的預言的理論,優於給出一個大的猜測範圍的或者不正確的理論。這再次反映了貝葉斯推斷中的核心概念(邊緣分布條件機率後驗機率)之間的聯繫 。






統計學中,過適英語:overfitting,或稱過度擬合現象是指在調適一個統計模型時,使用過多參數。對比於可取得的資料總量來說,一個荒謬的模型只要足夠複雜,是可以完美地適應資料。過適一般可以識為違反奧卡姆剃刀原 則。當可選擇的參數的自由度超過資料所包含資訊內容時,這會導致最後(調適後)模型使用任意的參數,這會減少或破壞模型一般化的能力更甚於適應資料。過適 的可能性不只取決於參數個數和資料,也跟模型架構與資料的一致性有關。此外對比於資料中預期的雜訊或錯誤數量,跟模型錯誤的數量也有關。

過適現象的觀念對機器學習也是很重要的。通常一個學習演算法是藉由訓練範例來訓練的。亦即預期結果的範例是可知的。而學習者則被認為須達到可以預測出其它範例的正確的結果,因此,應適用於一般化的情況而非只是訓練時所使用的現有資料(根據它的歸納偏向)。然而,學習者卻會去適應訓練資料中太特化但又隨機的特徵,特別是在當學習過程太久或範例太少時。在過適的過程中,當預測訓練範例結果的表現增加時,應用在未知資料的表現則變更差 。

在統計和機器學習中,為了避免過適現象,須要使用額外的技巧(如交叉驗證early stopping貝斯信息量準則赤池信息量準則model comparison),以指出何時會有更多訓練而沒有導致更好的一般化。人工神經網路的過適過程亦被認知為過度訓練(英語: overtraining)。在treatmeant learning中,使用最小最佳支援值(英語:minimum best support value)來避免過適。




Noisy (roughly linear) data is fitted to both linear and polynomial functions. Although the polynomial function is a perfect fit, the linear version can be expected to generalize better. In other words, if the two functions were used to extrapolate the data beyond the fit data, the linear function would make better predictions.


Machine learning

Usually a learning algorithm is trained using some set of “training data”: exemplary situations for which the desired output is known. The goal is that the algorithm will also perform well on predicting the output when fed “validation data” that was not encountered during its training.

Overfitting is the use of models or procedures that violate Occam’s razor, for example by including more adjustable parameters than are ultimately optimal, or by using a more complicated approach than is ultimately optimal. For an example where there are too many adjustable parameters, consider a dataset where training data for y can be adequately predicted by a linear function of two dependent variables. Such a function requires only three parameters (the intercept and two slopes). Replacing this simple function with a new, more complex quadratic function, or with a new, more complex linear function on more than two dependent variables, carries a risk: Occam’s razor implies that any given complex function is a priori less probable than any given simple function. If the new, more complicated function is selected instead of the simple function, and if there was not a large enough gain in training-data fit to offset the complexity increase, then the new complex function “overfits” the data, and the complex overfitted function will likely perform worse than the simpler function on validation data outside the training dataset, even though the complex function performed as well, or perhaps even better, on the training dataset.[2]

When comparing different types of models, complexity cannot be measured solely by counting how many parameters exist in each model; the expressivity of each parameter must be considered as well. For example, it is nontrivial to directly compare the complexity of a neural net (which can track curvilinear relationships) with m parameters to a regression model with n parameters.[2]

Overfitting is especially likely in cases where learning was performed too long or where training examples are rare, causing the learner to adjust to very specific random features of the training data, that have no causal relation to the target function. In this process of overfitting, the performance on the training examples still increases while the performance on unseen data becomes worse.

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It’s easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes; but this model will not generalize at all to new data, because those past times will never occur again.

Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight). One can intuitively understand overfitting from the fact that information from all past experience can be divided into two groups: information that is relevant for the future and irrelevant information (“noise”). Everything else being equal, the more difficult a criterion is to predict (i.e., the higher its uncertainty), the more noise exists in past information that needs to be ignored. The problem is determining which part to ignore. A learning algorithm that can reduce the chance of fitting noise is called robust.


Overfitting/overtraining in supervised learning (e.g., neural network). Training error is shown in blue, validation error in red, both as a function of the number of training cycles. If the validation error increases(positive slope) while the training error steadily decreases(negative slope) then a situation of overfitting may have occurred. The best predictive and fitted model would be where the validation error has its global minimum.



