W!o+ 的《小伶鼬工坊演義》︰神經網絡【深度學習】四‧上

據『歷史典故』上說,東晉慧遠大師主持東林寺,立下了規矩『影不出山迹不入谷』;一過虎溪,寺後山虎則吼。一日大詩人陶淵明和道士陸修靜來訪,談的投機,送行時不覺過了虎溪橋,待聞得虎嘯後方恍然大悟,相視大笑而別,後世稱作『虎溪三笑』。其後有清朝唐蝸寄題的廬山東林寺三笑庭名聯:

橋跨虎溪,三教三源流,三人三笑語;
蓮開僧舍,一花一世界,一葉一如來。

今天的人或許較熟悉英國詩人布莱克的『一沙一世界,一花一天堂。』名句。這個名句出自一首長詩《純真的徵兆》的起頭︰

snadworld

天堂鳥-花

220px-Blake_jacobsladder

Auguries of Innocence

To see a world in a grain of sand,
一粒沙裡世界
And a heaven in a wild flower,
一朵花中天堂
Hold infinity in the palm of your hand,
掌尺足無限
And eternity in an hour.
時針能永恆

布莱克生於 1757 年,幼年就個性獨特討厭正統學校的教條氣息,因而拒絕入學,博覽眾書自學成家,由於潛心研讀洛克博克經驗主義哲學著作,於是對這個大千世界有了深刻認識早熟的他為減輕家計重擔和考慮弟妹前途,放棄了畫家夢想,十四歲時就選擇了去雕版印刷作坊當個學徒,二十二歲出師,…
是英國浪漫主義詩人的第一人
雅各的天梯,布莱克的版畫,布莱克『自爬』?

博克的名著【壯美優美觀念起源之哲學探究】,布莱克用來觀察飛鳥之姿』── Auguries ──,體驗預示藝術參與,果然恰當!!就像『掌尺』的可成無限,用時針的『循環』以度永恆一樣;也許布莱克浪漫充滿著理性思辨,其要總在觀察

這時海之涯的另一端正是『拓荒』的時代,1774 年出生的 Johnny Chapman ,譜出『蘋果種子Appleseed傳奇

200px-Johnny_Appleseed_1

Appleseed

那從一顆『蘋果種子』能見著什麼呢?是生命的強韌?或破土而出的喜悅?還是夏娃偷吃的那個?也許可以這樣說︰『』的掌握改變了人類當時的生活,而一顆種子傳承造就世世代代的持有強尼‧蘋果種子所代表的『拓荒精神』──

STAR TREK
星艦迷航記
Where No One Has Gone Before
前人未至之境

──,是否終將化作『概念種子』等待時機『發芽』?

1776年7月4日,美國的大陸會議通過《獨立宣言》,宣言這個新國家是獨立的,完全脫離英國,目的是為『圖生存、求自由、謀幸福』,實現啟蒙運動的理想。之後過了一百三十九年,一九一五年三月十一日,Joseph Carl Robnett Licklider 誕生於密蘇里州的聖路易斯,不知距馬克吐溫湯姆──密蘇里州聖彼得斯堡──歷險記之地有多少距離?身為浸信會牧師獨子的他,自幼喜歡玩模型飛機,展現了工程天份,終身喜好修整汽車,史稱『計算機種子』。

─── 摘自《一個奇想!!

 

Lick 之一張備忘錄︰

一九六三年,轉任領導 ARPA 的 Behavioral Sciences Command & Control Research 辦公室,在一張標題為『Members and Affiliates of the Intergalactic Computer Network』給工作同仁的備忘錄上︰
“imagined as an electronic commons open to all, ‘the main and essential medium of informational interaction for governments, institutions, corporations, and individuals.’”

 

創生了網際網路。神經網絡的春天已隨著『實踐者』之播種和耕耘逐步來到。然而假使沒有『正確資料』,又怎能『訓練正確』呢?故須收集雲端『大數據』乎??不過縱有『大數據』,卻缺乏良好『方法』以及適當『結構』,只是蠻力而已!當真能得啟示耶!!因是知所謂實務何務也?從事『技法鍛鍊』之園地。即使樹莓派上不能使用 Theano 程式庫,無礙閱讀 Michael Nielsen 先生之文也︰

Convolutional neural networks in practice

We’ve now seen the core ideas behind convolutional neural networks. Let’s look at how they work in practice, by implementing some convolutional networks, and applying them to the MNIST digit classification problem. The program we’ll use to do this is called network3.py, and it’s an improved version of the programs network.py and network2.py developed in earlier chapters*

*Note also that network3.py incorporates ideas from the Theano library’s documentation on convolutional neural nets (notably the implementation of LeNet-5), from Misha Denil’s implementation of dropout, and from Chris Olah.

. If you wish to follow along, the code is available on GitHub. Note that we’ll work through the code for network3.py itself in the next section. In this section, we’ll use network3.py as a library to build convolutional networks.

The programs network.py and network2.py were implemented using Python and the matrix library Numpy. Those programs worked from first principles, and got right down into the details of backpropagation, stochastic gradient descent, and so on. But now that we understand those details, for network3.py we’re going to use a machine learning library known as Theano*

*See Theano: A CPU and GPU Math Expression Compiler in Python, by James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio (2010). Theano is also the basis for the popular Pylearn2 and Keras neural networks libraries. Other popular neural nets libraries at the time of this writing include Caffe and Torch.

. Using Theano makes it easy to implement backpropagation for convolutional neural networks, since it automatically computes all the mappings involved. Theano is also quite a bit faster than our earlier code (which was written to be easy to understand, not fast), and this makes it practical to train more complex networks. In particular, one great feature of Theano is that it can run code on either a CPU or, if available, a GPU. Running on a GPU provides a substantial speedup and, again, helps make it practical to train more complex networks.

If you wish to follow along, then you’ll need to get Theano running on your system. To install Theano, follow the instructions at the project’s homepage. The examples which follow were run using Theano 0.6*

*As I release this chapter, the current version of Theano has changed to version 0.7. I’ve actually rerun the examples under Theano 0.7 and get extremely similar results to those reported in the text.

. Some were run under Mac OS X Yosemite, with no GPU. Some were run on Ubuntu 14.04, with an NVIDIA GPU. And some of the experiments were run under both. To get network3.py running you’ll need to set the GPU flag to either True or False (as appropriate) in the network3.py source. Beyond that, to get Theano up and running on a GPU you may find the instructions here helpful. There are also tutorials on the web, easily found using Google, which can help you get things working. If you don’t have a GPU available locally, then you may wish to look into Amazon Web Services EC2 G2 spot instances. Note that even with a GPU the code will take some time to execute. Many of the experiments take from minutes to hours to run. On a CPU it may take days to run the most complex of the experiments. As in earlier chapters, I suggest setting things running, and continuing to read, occasionally coming back to check the output from the code. If you’re using a CPU, you may wish to reduce the number of training epochs for the more complex experiments, or perhaps omit them entirely.

To get a baseline, we’ll start with a shallow architecture using just a single hidden layer, containing 100 hidden neurons. We’ll train for 60 epochs, using a learning rate of \eta = 0.1, a mini-batch size of 10, and no regularization. Here we go*

*Code for the experiments in this section may be found in this script. Note that the code in the script simply duplicates and parallels the discussion in this section.

Note also that throughout the section I’ve explicitly specified the number of training epochs. I’ve done this for clarity about how we’re training. In practice, it’s worth using early stopping, that is, tracking accuracy on the validation set, and stopping training when we are confident the validation accuracy has stopped improving.:

>>> import network3
>>> from network3 import Network
>>> from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
>>> training_data, validation_data, test_data = network3.load_data_shared()
>>> mini_batch_size = 10
>>> net = Network([
        FullyConnectedLayer(n_in=784, n_out=100),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, 
            validation_data, test_data)

I obtained a best classification accuracy of 97.80 percent. This is the classification accuracy on the test_data, evaluated at the training epoch where we get the best classification accuracy on the validation_data. Using the validation data to decide when to evaluate the test accuracy helps avoid overfitting to the test data (see this earlier discussion of the use of validation data). We will follow this practice below. Your results may vary slightly, since the network’s weights and biases are randomly initialized*

*In fact, in this experiment I actually did three separate runs training a network with this architecture. I then reported the test accuracy which corresponded to the best validation accuracy from any of the three runs. Using multiple runs helps reduce variation in results, which is useful when comparing many architectures, as we are doing. I’ve followed this procedure below, except where noted. In practice, it made little difference to the results obtained..

This 97.80 percent accuracy is close to the 98.04 percent accuracy obtained back in Chapter 3, using a similar network architecture and learning hyper-parameters. In particular, both examples used a shallow network, with a single hidden layer containing 100 hidden neurons. Both also trained for 60 epochs, used a mini-batch size of 10, and a learning rate of \eta = 0.1.

There were, however, two differences in the earlier network. First, we regularized the earlier network, to help reduce the effects of overfitting. Regularizing the current network does improve the accuracies, but the gain is only small, and so we’ll hold off worrying about regularization until later. Second, while the final layer in the earlier network used sigmoid activations and the cross-entropy cost function, the current network uses a softmax final layer, and the log-likelihood cost function. As explained in Chapter 3 this isn’t a big change. I haven’t made this switch for any particularly deep reason – mostly, I’ve done it because softmax plus log-likelihood cost is more common in modern image classification networks.

Can we do better than these results using a deeper network architecture?

Let’s begin by inserting a convolutional layer, right at the beginning of the network. We’ll use 5 by 5 local receptive fields, a stride length of 1, and 20 feature maps. We’ll also insert a max-pooling layer, which combines the features using 2 by 2 pooling windows. So the overall network architecture looks much like the architecture discussed in the last section, but with an extra fully-connected layer:

In this architecture, we can think of the convolutional and pooling layers as learning about local spatial structure in the input training image, while the later, fully-connected layer learns at a more abstract level, integrating global information from across the entire image. This is a common pattern in convolutional neural networks.

Let’s train such a network, and see how it performs*

*I’ve continued to use a mini-batch size of 10 here. In fact, as we discussed earlier it may be possible to speed up training using larger mini-batches. I’ve continued to use the same mini-batch size mostly for consistency with the experiments in earlier chapters.:

>>> net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), 
                      filter_shape=(20, 1, 5, 5), 
                      poolsize=(2, 2)),
        FullyConnectedLayer(n_in=20*12*12, n_out=100),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, 
            validation_data, test_data)   

That gets us to 98.78 percent accuracy, which is a considerable improvement over any of our previous results. Indeed, we’ve reduced our error rate by better than a third, which is a great improvement.

In specifying the network structure, I’ve treated the convolutional and pooling layers as a single layer. Whether they’re regarded as separate layers or as a single layer is to some extent a matter of taste. network3.py treats them as a single layer because it makes the code for network3.py a little more compact. However, it is easy to modify network3.py so the layers can be specified separately, if desired.

Can we improve on the 98.78 percent classification accuracy?

Let’s try inserting a second convolutional-pooling layer. We’ll make the insertion between the existing convolutional-pooling layer and the fully-connected hidden layer. Again, we’ll use a 5 \times 5 local receptive field, and pool over 2 \times 2 regions. Let’s see what happens when we train using similar hyper-parameters to before:

>>> net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), 
                      filter_shape=(20, 1, 5, 5), 
                      poolsize=(2, 2)),
        ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), 
                      filter_shape=(40, 20, 5, 5), 
                      poolsize=(2, 2)),
        FullyConnectedLayer(n_in=40*4*4, n_out=100),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, 
            validation_data, test_data)

Once again, we get an improvement: we’re now at 99.06 percent classification accuracy!

There’s two natural questions to ask at this point. The first question is: what does it even mean to apply a second convolutional-pooling layer? In fact, you can think of the second convolutional-pooling layer as having as input 12 \times 12 “images”, whose “pixels” represent the presence (or absence) of particular localized features in the original input image. So you can think of this layer as having as input a version of the original input image. That version is abstracted and condensed, but still has a lot of spatial structure, and so it makes sense to use a second convolutional-pooling layer.

That’s a satisfying point of view, but gives rise to a second question. The output from the previous layer involves 20 separate feature maps, and so there are 20 \times 12 \times 12 inputs to the second convolutional-pooling layer. It’s as though we’ve got 20 separate images input to the convolutional-pooling layer, not a single image, as was the case for the first convolutional-pooling layer. How should neurons in the second convolutional-pooling layer respond to these multiple input images? In fact, we’ll allow each neuron in this layer to learn from all 20 \times 5 \times 5 input neurons in its local receptive field. More informally: the feature detectors in the second convolutional-pooling layer have access to all the features from the previous layer, but only within their particular local receptive field* *This issue would have arisen in the first layer if the input images were in color. In that case we’d have 3 input features for each pixel, corresponding to red, green and blue channels in the input image. So we’d allow the feature detectors to have access to all color information, but only within a given local receptive field..

………