W!o+ 的《小伶鼬工坊演義》︰神經網絡【隨機變數】三

讀 Michael Nielsen 先生這段文字,彷彿信手拈來隨筆而成︰

Let’s compare the results for both our old and new approaches to weight initialization, using the MNIST digit classification task. As before, we’ll use 30 hidden neurons, a mini-batch size of 10, a regularization parameter \lambda = 5.0, and the cross-entropy cost function. We will decrease the learning rate slightly from \eta = 0.5 to 0.1, since that makes the results a little more easily visible in the graphs. We can train using the old method of weight initialization:

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.1, lmbda = 5.0,
... evaluation_data=validation_data, 
... monitor_evaluation_accuracy=True)

We can also train using the new approach to weight initialization. This is actually even easier, since network2‘s default way of initializing the weights is using this new approach. That means we can omit the net.large_weight_initializer() call above:

>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.SGD(training_data, 30, 10, 0.1, lmbda = 5.0,
... evaluation_data=validation_data, 
... monitor_evaluation_accuracy=True)

Plotting the results*

*The program used to generate this and the next graph is weight_initialization.py.

, we obtain:

In both cases, we end up with a classification accuracy somewhat over 96 percent. The final classification accuracy is almost exactly the same in the two cases. But the new initialization technique brings us there much, much faster. At the end of the first epoch of training the old approach to weight initialization has a classification accuracy under 87 percent, while the new approach is already almost 93 percent. What appears to be going on is that our new approach to weight initialization starts us off in a much better regime, which lets us get good results much more quickly. The same phenomenon is also seen if we plot results with 100 hidden neurons:

In this case, the two curves don’t quite meet. However, my experiments suggest that with just a few more epochs of training (not shown) the accuracies become almost exactly the same. So on the basis of these experiments it looks as though the improved weight initialization only speeds up learning, it doesn’t change the final performance of our networks. However, in Chapter 4 we’ll see examples of neural networks where the long-run behaviour is significantly better with the 1/\sqrt{n_{\rm in}} weight initialization. Thus it’s not only the speed of learning which is improved, it’s sometimes also the final performance.

The 1/\sqrt{n_{\rm in}} approach to weight initialization helps improve the way our neural nets learn. Other techniques for weight initialization have also been proposed, many building on this basic idea. I won’t review the other approaches here, since 1/\sqrt{n_{\rm in}} works well enough for our purposes. If you’re interested in looking further, I recommend looking at the discussion on pages 14 and 15 of a 2012 paper by Yoshua Bengio*

*Practical Recommendations for Gradient-Based Training of Deep Architectures, by Yoshua Bengio (2012).

, as well as the references therein.

───

 

其實『估量』複雜現象,需要很多『經驗』以及深厚的『學養』,絕非容易之事,僅補以一則故事︰

莫要以為合理『估計』就是件容易的事,傳聞

費米以他通過非常少量或不精確的數據來得到比較好的估計的能力被廣泛熟知,一個例子就是他在主要領導的曼哈頓計劃中 估算核爆炸的「當量數」。1945 年 7 月 16 日晚上,原子彈在內華達州的沙漠引爆成功時,費米在原子彈試爆現場附近,突然躍起向空中撒了一把碎紙片,爆炸 後氣浪將紙片急速地捲走,他緊追紙片跑了幾步 ,並根據紙片飛出的距離估算了核爆炸的「當量」,費米計算出的爆炸威力相當於一萬噸 TNT 炸藥,非常接近現在 所接受的二萬噸的數值,之間的誤差少於一個數量級

───

所說,正是一類稱之為『費米問題』的例子。若果能精通費米自說的『古典範例』︰

The classic Fermi problem, generally attributed to Fermi,[2] is “How many piano tuners are there in Chicago?” A typical solution to this problem involves multiplying a series of estimates that yield the correct answer if the estimates are correct. For example, we might make the following assumptions:

  1. There are approximately 9,000,000 people living in Chicago.
  2. On average, there are two persons in each household in Chicago.
  3. Roughly one household in twenty has a piano that is tuned regularly.
  4. Pianos that are tuned regularly are tuned on average about once per year.
  5. It takes a piano tuner about two hours to tune a piano, including travel time.
  6. Each piano tuner works eight hours in a day, five days in a week, and 50 weeks in a year.

From these assumptions, we can compute that the number of piano tunings in a single year in Chicago is

(9,000,000 persons in Chicago) / (2 persons/household) × (1 piano/20 households) × (1 piano tuning per piano per year) = 225,000 piano tunings per year in Chicago.

We can similarly calculate that the average piano tuner performs

(50 weeks/year)×(5 days/week)×(8 hours/day)/(2 hours to tune a piano) = 1000 piano tunings per year per piano tuner.

Dividing gives

(225,000 piano tunings per year in Chicago) / (1000 piano tunings per year per piano tuner) = 225 piano tuners in Chicago.

The actual number of piano tuners in Chicago is about 290.[3]

那麼只從『台灣』的歷史與地理資料中的幾個『數據』,你也可以『估算』出在山林地『開發前』這裡曾經有過『多少棵樹』?而今還剩下『幾許原始林』的呢??也許在這『光輝國慶』之時,能對『水土保持』持有多些關注,正是愛『寶島』之舉,未來也可少些『颱風災害』的吧!!

─── 摘自《勇闖新世界︰ W!o《卡夫卡村》變形祭︰感知自然‧數據分析‧二