W!o+ 的《小伶鼬工坊演義》︰神經網絡【深度學習】四‧下

屈原‧ 天問

曰遂古之初,誰傳道之?
上下未形,何由考之?
冥昭瞢闇,誰能極之?
馮翼惟像,何以識之?
明明闇闇,惟時何爲?
陰陽三合,何本何化?
圜則九重,孰營度之?
惟茲何功,孰初作之?
斡維焉系,天極焉加?
八柱何當,東南何虧?
九天之際,安放安屬?
隅隈多有,誰知其數?
天何所遝?十二焉分?
日月安屬?列星安陳?
出自湯穀,次於蒙泛。
自明及晦,所行幾里?
夜光何德,死則又育?
厥利維何,而顧兔在腹?
女歧無合,夫焉取九子?
伯強何處?惠氣安在?
何闔而晦?何開而明?
角宿未旦,曜靈安藏?
不任汩鴻,師何以尚之?
僉曰“何憂”,何不課而行之?
鴟龜曳銜,鯀何聽焉?
顺欲成功,帝何刑焉?
永遏在羽山,夫何三年不施?
伯禹愎鯀,夫何以變化?
纂就前緒,遂成考功。
何續初繼業,而厥謀不同?
洪泉極深,何以窴之?
地方九則,何以墳之?
河海應龍?何盡何曆?
鯀何所營?禹何所成?
康回馮怒,墜何故以東南傾?
九州安錯?川穀何洿?
東流不溢,孰知其故?
東西南北,其修孰多?
南北顺堕,其衍幾何?
昆崙縣圃,其尻安在?
增城九重,其高幾里?
四方之門,其誰從焉?
西北辟啟,何氣通焉?
日安不到?燭龍何照?
羲和之未颺,若華何光?
何所冬暖?何所夏寒?
焉有石林?何獸能言?
焉有虯龍、負熊以游?
雄虺九首,鯈忽焉在?
何所不死?長人何守?
靡蓱九衢,枲華安居?
靈蛇吞象,厥大何如?
黑水、玄趾,三危安在?
延年不死,壽何所止?
鯪魚何所?鬿堆焉處?
羿焉彃日?烏焉解羽?
禹之力獻功,降省下土四方。
焉得彼嵞山女,而通之於台桑?
閔妃疋合,厥身是繼。
胡爲嗜不同味,而快朝飽?
啟代益作後,卒然離蠥。
何啟惟憂,而能拘是達?
皆歸射鞠,而無害厥躬。
何後益作革,而禹播降?
啟棘賓商,《九辨》、《九歌》。
何勤子屠母,而死分竟地?
帝降夷羿,革孽夏民。
胡射夫河伯,而妻彼雒嬪?
馮珧利決,封豨是射。
何獻蒸肉之膏,而後帝不若?
浞娶純狐,眩妻爰謀。
何羿之射革,而交吞揆之?
阻窮西征,岩何越焉?
化爲黄熊,巫何活焉?
鹹播秬黍,莆雚是營。
何由並投,而鯀疾修盈?
白蜺嬰茀,胡爲此堂?
安得夫良藥,不能固臧?
天式從横,陽離爰死。
大鳥何鳴,夫焉喪厥體?
蓱號起雨,何以興之?
撰體脅鹿,何以膺之?
鼇戴山拚,何以安之?
釋舟陵行,何之遷之?
惟澆在戶,何求於嫂?
何少康逐犬,而顛隕厥首?
女歧縫裳,而館同爰止。
何顛易厥首,而親以逢殆?
湯謀易旅,何以厚之?
覆舟斟尋,何道取之?
桀伐蒙山,何所得焉?
妺嬉何肆,湯何殛焉?
舜閔在家,父何以鱞?
堯不姚告,二女何親?
厥萌在初,何所意焉?
璜台十成,誰所極焉?
登立爲帝,孰道尚之?
女媧有體,孰制匠之?
舜服厥弟,終然爲害。
何肆犬豕,而厥身不危敗?
吳穫迄古,南嶽是止。
孰期去斯,得兩男子?
緣鵠飾玉,後帝是饗。
何承謀夏桀,終以滅喪?
帝乃降觀,下逢伊摯。
何條放致罰,而黎服大說?
簡狄在台嚳何宜?
玄鳥致貽女何喜,該秉季德,厥父是臧。
胡終弊於有扈,牧夫牛羊?
幹協時舞,何以懷之?
平脅曼膚,何以肥之?
有扈牧豎,雲何而逢?
擊床先出,其命何從?
恒秉季德,焉得夫樸牛?
何往營班祿,不但還來?
昏微遵蹟,有狄不寧。
何繁鳥萃棘,負子肆情?
眩弟並淫,危害厥兄。
何變化以作詐,而後嗣逢長?
成湯東巡,有莘爰極。
何乞彼小臣,而吉妃是得?
水濱之木,得彼小子。
夫何惡之,媵有莘之婦?
湯出重泉,夫何罪尤?
不勝心伐帝,夫誰使挑之?
會晁爭盟,何踐吾期?
蒼鳥群飛,孰使萃之?
列擊紂躬,叔旦不嘉。
何親揆發,何周之命以咨嗟?
授殷天下,其位安施?
反成乃亡,其罪伊何?
爭遣伐器,何以行之?
並驅擊翼,何以將之?
昭後成游,南土爰底。
厥利惟何,逢彼白雉?
穆王巧挴,夫何周流?
環理天下,夫何索求?
妖夫曳炫,何號於市?
周幽誰誅?焉得夫褒姒?
天命反側,何罰何佑?
齊桓九會,卒然身殺。
彼王紂之躬,孰使亂惑?
何惡輔弼,讒諂是服?
比幹何逆,而抑沉之?
雷開何顺,而賜封之?
何聖人之一德,卒其異方:
梅伯受醢,箕子詳狂?
稷維元子,帝何竺之?
投之於冰上,鳥何燠之?
何馮弓挾矢,殊能將之?
既驚帝切激,何逢長之?
伯昌號衰,秉鞭作牧。
何令徹彼岐社,命有殷國?
遷藏就岐,何能依?
殷有惑婦,何所譏?
受賜茲醢,西伯上告。
何親就上帝罰,殷之命以不救?
師望在肆,昌何識?
鼓刀颺聲,後何喜?
武發殺殷,何所悒?
載屍集戰,何所急?
伯林雉經,維其何故?
何感天抑墜,夫誰畏懼?
皇天集命,惟何戒之?
受禮天下,又使至代之?
初湯臣摯,後茲承輔。
何卒官湯,尊食宗緒?
勳闔、夢生,少離散亡。
何壯武曆,能流厥嚴?
彭鏗斟雉,帝何饗?
受壽永多,夫何久長?
中央共牧,後何怒?
蜂蛾微命,力何固?
驚女采薇,鹿何佑?
北至回水,萃何喜?
兄有噬犬,弟何欲?
易之以百兩,卒無祿?
薄暮雷電,歸何憂?
厥嚴不奉,帝何求?
伏匿穴處,爰何雲?
荆勳作師,夫何長?
悟過改更,我又何言?
吳光爭國,久餘是勝。
何環穿自閭社丘陵,爰出子文?
吾告堵敖以不長。
何試上自予,忠名彌彰?

 

人世間不祇需『問』之精神,還要有『追』的毅力︰

……黑傑克終於發現了咔嗎的『密室』,在多次的探訪後,找到一道『暗門』,不知通往何鄉何處?心中掂量著該退?最終發了一則『簡訊』,自此消逝於人世間……

1H ㄍㄞˋ 5W ㄉㄨㄟ一˙

1H 解?5W 追?

。黑傑克的言語駁雜,這則簡訊又太短,實在不容易『定論』,就妄說一通吧!『1H 解』當是說他已經了解了『Know How』;順著說『5W 追』自該是『Who?『Where?『When?『What?『Why?由於『誰地時事』大體已解,那可以『』的也只剩下『◎』──  Why 為什麼 ── 了。此處用◎是有原因的,英文的 Why 又說成 For What,這就是『為什麼』的翻譯由來因為見到黑傑克用了『追問』之追,頗感『學問之道』的不容易,故此特用◎,彰顯他的 Why not ── 追追追 ──精神!!

─── 摘自《黑傑克的咔嗎!!明暗之交

 

如是者!或能趕得上 Michael Nielsen 先生之步伐耶??

Why we only applied dropout to the fully-connected layers: If you look carefully at the code above, you’ll notice that we applied dropout only to the fully-connected section of the network, not to the convolutional layers. In principle we could apply a similar procedure to the convolutional layers. But, in fact, there’s no need: the convolutional layers have considerable inbuilt resistance to overfitting. The reason is that the shared weights mean that convolutional filters are forced to learn from across the entire image. This makes them less likely to pick up on local idiosyncracies in the training data. And so there is less need to apply other regularizers, such as dropout.

Going further: It’s possible to improve performance on MNIST still further. Rodrigo Benenson has compiled an informative summary page, showing progress over the years, with links to papers. Many of these papers use deep convolutional networks along lines similar to the networks we’ve been using. If you dig through the papers you’ll find many interesting techniques, and you may enjoy implementing some of them. If you do so it’s wise to start implementation with a simple network that can be trained quickly, which will help you more rapidly understand what is going on.

For the most part, I won’t try to survey this recent work. But I can’t resist making one exception. It’s a 2010 paper by Cireșan, Meier, Gambardella, and Schmidhuber*

*Deep, Big, Simple Neural Nets Excel on Handwritten Digit Recognition, by Dan Claudiu Cireșan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber (2010)..

What I like about this paper is how simple it is. The network is a many-layer neural network, using only fully-connected layers (no convolutions). Their most successful network had hidden layers containing 2,500, 2,000, 1,500, 1,000, and 500 neurons, respectively. They used ideas similar to Simard et al to expand their training data. But apart from that, they used few other tricks, including no convolutional layers: it was a plain, vanilla network, of the kind that, with enough patience, could have been trained in the 1980s (if the MNIST data set had existed), given enough computing power(!) They achieved a classification accuracy of 99.65 percent, more or less the same as ours. The key was to use a very large, very deep network, and to use a GPU to speed up training. This let them train for many epochs. They also took advantage of their long training times to gradually decrease the learning rate from 10^{-3} to 10^{-6}. It’s a fun exercise to try to match these results using an architecture like theirs.

Why are we able to train? We saw in the last chapter that there are fundamental obstructions to training in deep, many-layer neural networks. In particular, we saw that the gradient tends to be quite unstable: as we move from the output layer to earlier layers the gradient tends to either vanish (the vanishing gradient problem) or explode (the exploding gradient problem). Since the gradient is the signal we use to train, this causes problems.

How have we avoided those results?

Of course, the answer is that we haven’t avoided these results. Instead, we’ve done a few things that help us proceed anyway. In particular: (1) Using convolutional layers greatly reduces the number of parameters in those layers, making the learning problem much easier; (2) Using more powerful regularization techniques (notably dropout and convolutional layers) to reduce overfitting, which is otherwise more of a problem in more complex networks; (3) Using rectified linear units instead of sigmoid neurons, to speed up training – empirically, often by a factor of 35; (4) Using GPUs and being willing to train for a long period of time. In particular, in our final experiments we trained for 40 epochs using a data set 5 times larger than the raw MNIST training data. Earlier in the book we mostly trained for 30 epochs using just the raw training data. Combining factors (3) and (4) it’s as though we’ve trained a factor perhaps 30 times longer than before.

Your response may be “Is that it? Is that all we had to do to train deep networks? What’s all the fuss about?”

Of course, we’ve used other ideas, too: making use of sufficiently large data sets (to help avoid overfitting); using the right cost function (to avoid a learning slowdown); using good weight initializations (also to avoid a learning slowdown, due to neuron saturation); algorithmically expanding the training data. We discussed these and other ideas in earlier chapters, and have for the most part been able to reuse these ideas with little comment in this chapter.

With that said, this really is a rather simple set of ideas. Simple, but powerful, when used in concert. Getting started with deep learning has turned out to be pretty easy!

How deep are these networks, anyway? Counting the convolutional-pooling layers as single layers, our final architecture has 4 hidden layers. Does such a network really deserve to be called a deep network? Of course, 4 hidden layers is many more than in the shallow networks we studied earlier. Most of those networks only had a single hidden layer, or occasionally 2 hidden layers. On the other hand, as of 2015 state-of-the-art deep networks sometimes have dozens of hidden layers. I’ve occasionally heard people adopt a deeper-than-thou attitude, holding that if you’re not keeping-up-with-the-Joneses in terms of number of hidden layers, then you’re not really doing deep learning. I’m not sympathetic to this attitude, in part because it makes the definition of deep learning into something which depends upon the result-of-the-moment. The real breakthrough in deep learning was to realize that it’s practical to go beyond the shallow 1– and 2-hidden layer networks that dominated work until the mid-2000s. That really was a significant breakthrough, opening up the exploration of much more expressive models. But beyond that, the number of layers is not of primary fundamental interest. Rather, the use of deeper networks is a tool to use to help achieve other goals – like better classification accuracies.

A word on procedure: In this section, we’ve smoothly moved from single hidden-layer shallow networks to many-layer convolutional networks. It’s all seemed so easy! We make a change and, for the most part, we get an improvement. If you start experimenting, I can guarantee things won’t always be so smooth. The reason is that I’ve presented a cleaned-up narrative, omitting many experiments – including many failed experiments. This cleaned-up narrative will hopefully help you get clear on the basic ideas. But it also runs the risk of conveying an incomplete impression. Getting a good, working network can involve a lot of trial and error, and occasional frustration. In practice, you should expect to engage in quite a bit of experimentation. To speed that process up you may find it helpful to revisit Chapter 3’s discussion of how to choose a neural network’s hyper-parameters, and perhaps also to look at some of the further reading suggested in that section.

The code for our convolutional networks

Alright, let’s take a look at the code for our program, network3.py. Structurally, it’s similar to network2.py, the program we developed in Chapter 3, although the details differ, due to the use of Theano. We’ll start by looking at the FullyConnectedLayer class, which is similar to the layers studied earlier in the book. Here’s the code (discussion below):

………