W!o+ 的《小伶鼬工坊演義》︰神經網絡【hyper-parameters】一

此段文本 Michael Nielsen 先生談及『超參數』hyper-parameters ︰

Let’s rerun the above experiment, changing the number of hidden neurons to 100. As was the case earlier, if you’re running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so it’s wise to continue reading in parallel while the code executes.

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Sure enough, this improves the results to 96.59 percent. At least in this case, using more hidden neurons helps us get better results*

*Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse. Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks..

Of course, to obtain these accuracies I had to make specific choices for the number of epochs of training, the mini-batch size, and the learning rate, \eta. As I mentioned above, these are known as hyper-parameters for our neural network, in order to distinguish them from the parameters (weights and biases) learnt by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results. Suppose, for example, that we’d chosen the learning rate to be \eta =0.001,

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)

The results are much less encouraging,

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000

However, you can see that the performance of the network is getting slowly better over time. That suggests increasing the learning rate, say to \eta =0.01. If we do that, we get better results, which suggests increasing the learning rate again. (If making a change improves things, try doing more!) If we do that several times over, we’ll end up with a learning rate of something like \eta = 1.0 (and perhaps fine tune to 3.0), which is close to our earlier experiments. So even though we initially made a poor choice of hyper-parameters, we at least got enough information to help us improve our choice of hyper-parameters.

In general, debugging a neural network can be challenging. This is especially true when the initial choice of hyper-parameters produces results no better than random noise. Suppose we try the successful 30 hidden neuron network architecture from earlier, but with the learning rate changed to \eta =100.0:

>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)

At this point we’ve actually gone too far, and the learning rate is too high:

Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000

Now imagine that we were coming to this problem for the first time. Of course, we know from our earlier experiments that the right thing to do is to decrease the learning rate. But if we were coming to this problem for the first time then there wouldn’t be much in the output to guide us on what to do. We might worry not only about the learning rate, but about every other aspect of our neural network. We might wonder if we’ve initialized the weights and biases in a way that makes it hard for the network to learn? Or maybe we don’t have enough training data to get meaningful learning? Perhaps we haven’t run for enough epochs? Or maybe it’s impossible for a neural network with this architecture to learn to recognize handwritten digits? Maybe the learning rate is too low? Or, maybe, the learning rate is too high? When you’re coming to a problem for the first time, you’re not always sure.

The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. We’ll discuss all these at length through the book, including how I chose the hyper-parameters above.

───

 

什麼是『超參數』呢?維基百科詞條所說之

Hyperparameter

In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis.

For example, if one is using a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:

  • p is a parameter of the underlying system (Bernoulli distribution), and
  • α and β are parameters of the prior distribution (beta distribution), hence hyperparameters.

One may take a single value for a given hyperparameter, or one can iterate and take a probability distribution on the hyperparameter itself, called a hyperprior.

───

 

似乎帶有『先驗性』,是否與之相干呢??不過從 Michael Nielsen 先生之講法聽來彷彿是『後驗』的!!果不可考之以『Stochastic gradient descent』得到『啟發』的嗎??!!

近代西方傳統中用『經驗』的『先後』區分了兩類『知識』,一種是『先於經驗』,無需經驗就有的『先驗知識』;另一種是『後於經驗』,源自某種經驗才有的『後驗知識』。 理性主義者通常相信先驗知識的存在,而經驗主義者認為即使存在著先驗知識,相對於眾多的後驗知識來講它也並不重要。在機率的世界理也用著這樣的術語。舉個 例說,一個骰子我們推論它是『公正的』,所以每個面的『先驗機率』是 \frac{1}{6};一個公平的硬幣,正反兩面的機率相等是 \frac{1}{2}。而『後驗機率』是一種將相關的『證據』或者『背景』考慮後才給定的『條件機率』。根據條件機率的定義,在事件 C 發生的條件下事件 A 發生的機率是︰

P(A|C)=\frac{P(A \cap C)}{P(C)}

比方說擲兩顆骰子,在點數和為 6 的條件下,其中有一顆骰子是 2 點的機率為 \frac{\frac{2}{36}}{\frac{5}{36}} =  \frac{2}{5}

一九零零年英國倫敦大學的 Arnold Zuboff 教授發表了一篇寫於一九八六年的『One Self: The Logic of Experience』的論文,提出了『睡美人的問題』。

280px-Brewtnall_-_Sleeping_Beauty

220px-Dornröschen

睡美人被詳細告知細節,自願參加下面的實驗︰

周日她將被安排入睡,實驗過程中會被喚醒一次或者兩次,然後用一種失憶的藥,她將不會記得自己曾經被喚醒過。這個實驗中會擲一個公平的硬幣來決定它將採取的程序︰

如果硬幣的結果是『頭』,她只會在『禮拜一』被喚醒與訪談。
如果硬幣的結果是『尾』,她將會在『禮拜一』和『禮拜二』都被喚醒與訪談。

無論是上面哪種情況,她終會在『禮拜三』被喚醒,而且沒有訪談就結束了實驗。每次她被喚醒與訪談時,她將被問到︰你現在對『硬幣的結果是頭』的『相信度』是什麼?

這個問題至今爭論不休,『三分之一者』 Thirder 認為是 \frac{1}{3},『二分之一者』 Halfer 認為是 \frac{1}{2}。睡美人真的能有一個『正確答案』嗎?一個只擲一次頭尾兩種結果的硬幣,帶出可能一天或兩天的訪談,將要如何思考『機率』的先驗或後驗說法的呢?一般機率論是用『各種可能出現之狀況』 ── 樣本空間 ── 的『相對發生頻率』來作測度;如果不能測度時,或許用著『無差別』或說『無法區分』去假設它們相對發生頻率都『一樣』。這樣『樣本空間』與『測度假設』就是爭論的緣由的了。假使我們用硬幣結果集合 {頭,尾} 與訪談時間集合 {禮拜一,禮拜二},從公平硬幣角度來看這個問題中的事件機率︰

機率【頭,禮拜一】= \frac{1}{2}
機率【頭,禮拜二】= 0
機率【尾,禮拜一】= \frac{1}{4}
機率【尾,禮拜二】= \frac{1}{4}

這個『機率【頭,禮拜二】= 0』就是引發爭論的主焦點,因為它是一個『不可能』發生的事件。從機率的經驗事件取樣之觀點來看,也許在考慮『樣本空間』時根本該將之去除,然而這樣的一個『觀察者』又為什麼不該假設『所有可能發生事件』的『機率』不是相同的 \frac{1}{3} 呢??

─── 摘自《改不改??變不變!!

 

轉眼又是『勞動節』的了,而今『正義』依然掙扎中︰

如果問什麼是『字詞』之『意義』?根據《教育大辭書》之

名詞解釋:  字、詞、語句或符號所表達的意旨就是『意義』。

意義的種類上,一般分為『事實的意義』 Factual Meaning 和『情緒性意義』 Emotive Meaning ;如說「那個人」和「他那個人」兩個陳述句中,前者所表達的是事實性意義,而後者則含有情緒性意義。其他哲學家根據不同的標準,也有不同的分類法。

這個『情緒性意義』,該辭書又指出

情緒性意義指在說話者的辭句之中含著某種情感或態度,或者是說話者意在表現一種情感或態度;近似通常所謂之「情見乎詞」的另一面,是由詞的意義中見情。   情緒性意義溯源於『維也納學圈』 Vienna Circle 及其後繼者想為『語句的意義』 meaningfulness 建立一個『可以檢證』的規準。有些哲學家曾試從分析『道德』或與『詩辭』探討,然而卻難以通過『意義規準』之檢證,例如愉快之為善(如我高興)是否有『客觀』的『真實性』,便是問題。又如孔子說:「逝者如斯夫,不捨晝夜!」一則是實際可見,一則是「情在其中」,則孔子只是由『經驗』而驗證一項『事實』,抑或是『感歎』時光流逝,也難以用任何基準來證實。

假使說

西方的『思辨理性』長於『批判』,難喜歡『矛盾為伍』;

東方之『生生哲學』善作『類比』,或高興『和光同塵』。

既都是『』,這個『差異』將從何而來?

人總有『』、『』、『』、『』,有時

』『』不能為『』,一會兒

』『』刻意之『』。又該怎麼說??也許一切

源自人有『自我意識』,形成『自我影像』,可以『自我疏離』 ,因此能夠『自欺欺人』,或是迫於『環境社會』所逼,還是無法『認識自己』所致,『人性』長久以來的『積習浸染』── 文化 DNA ?? ── 恐是 □○ 『說不清』的事吧!

─── 摘自《字詞網絡︰ WordNet 《二》 勞動離正義多遠?!