《題李凝幽居》唐‧賈島

閑居少鄰並，草徑入荒園。
鳥宿池邊樹，僧敲月下門。
過橋分野色，移石動雲根。
暫去還來此，幽期不負言。

元和六年（811），謁韓愈，以詩深得賞識。賈島是著名的苦吟派詩人，著名的典故“推敲”即出自他處。傳說賈島在長安跨驢背吟“鳥宿池邊樹，僧敲月下門”，鍊“推” 、“敲”字不決，後世乃以斟酌文字爲“推敲”。賈島在韓門時，與張籍、孟郊、馬戴、姚合往來酬唱甚密。他擅長五律，苦吟成癖。其詩造語奇特，給人印象深刻，常寫荒寒冷落之景，表現愁苦幽獨之情。如“獨行潭底影，數息樹邊身”，“歸吏封宵鑰，行蛇入古桐”等句。這類慘淡經營的詩句，構成他奇僻清峭的風格，給人以枯寂陰黯之感。也有於幽獨中表現清美意境的詩和語言質樸自然、感情純真率直、風格豪爽雄健的詩。

傳說賈島專心『推』、『敲』之際，不知直闖韓愈轎前。因而得受『一』字之教 ─── 一字師 ───。韓愈以為『鳥宿池邊樹』入夜也，直『推』門恐有唐突之嫌，不如先『敲』門矣。

文字推敲，語詞鍛鍊，意境組合排列之嘗試，此賈島『苦吟』之法。實通『天下第一法』 ── 試誤法 ───

Trial and error

Trial and error is a fundamental method of solving problems.^[1] It is characterised by repeated, varied attempts which are continued until success,^[2] or until the agent stops trying.

According to W.H. Thorpe, the term was devised by C. Lloyd Morgan after trying out similar phrases “trial and failure” and “trial and practice”.^[3] Under Morgan’s Canon, animal behaviour should be explained in the simplest possible way. Where behaviour seems to imply higher mental processes, it might be explained by trial-and-error learning. An example is the skillful way in which his terrier Tony opened the garden gate, easily misunderstood as an insightful act by someone seeing the final behaviour. Lloyd Morgan, however, had watched and recorded the series of approximations by which the dog had gradually learned the response, and could demonstrate that no insight was required to explain it.

Edward Thorndike showed how to manage a trial-and-error experiment in the laboratory. In his famous experiment, a cat was placed in a series of puzzle boxes in order to study the law of effect in learning.^[4] He plotted learning curves which recorded the timing for each trial. Thorndike’s key observation was that learning was promoted by positive results, which was later refined and extended by B.F. Skinner‘s operant conditioning.

Trial and error is also a heuristic method of problem solving, repair, tuning, or obtaining knowledge. In the field of computer science, the method is called generate and test. In elementary algebra, when solving equations, it is “guess and check”.

This approach can be seen as one of the two basic approaches to problem solving, contrasted with an approach using insight and theory. However, there are intermediate methods which for example, use theory to guide the method, an approach known as guided empiricism.

Computer_graph_created_by_trial_and_error

Trial with PC

的門徑。且聽 Michael Nielsen 先生弘揚『試誤法』之精神︰

Using rectified linear units: The network we’ve developed at this point is actually a variant of one of the networks used in the seminal 1998 paper*

*“Gradient-based learning applied to document recognition”, by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998). There are many differences of detail, but broadly speaking our network is quite similar to the networks described in the paper.

introducing the MNIST problem, a network known as LeNet-5. It’s a good foundation for further experimentation, and for building up understanding and intuition. In particular, there are many ways we can vary the network in an attempt to improve our results.

As a beginning, let’s change our neurons so that instead of using a sigmoid activation function, we use rectified linear units. That is, we’ll use the activation function $f(z) \equiv \max(0, z)$ $f (z) \equiv max (0, z)$ . We’ll train for $60$ $60$ epochs, with a learning rate of $\eta = 0.03$ $η = 0.03$ . I also found that it helps a little to use some l2 regularization, with regularization parameter $\lambda = 0.1$ $λ = 0.1$ :

>>> from network3 import ReLU
>>> net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), 
                      filter_shape=(20, 1, 5, 5), 
                      poolsize=(2, 2), 
                      activation_fn=ReLU),
        ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), 
                      filter_shape=(40, 20, 5, 5), 
                      poolsize=(2, 2), 
                      activation_fn=ReLU),
        FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.03, 
            validation_data, test_data, lmbda=0.1)

I obtained a classification accuracy of $99.23$ $99.23$ percent. It’s a modest improvement over the sigmoid results ( $99.06$ $99.06$ ). However, across all my experiments I found that networks based on rectified linear units consistently outperformed networks based on sigmoid activation functions. There appears to be a real gain in moving to rectified linear units for this problem.

What makes the rectified linear activation function better than the sigmoid or tanh functions? At present, we have a poor understanding of the answer to this question. Indeed, rectified linear units have only begun to be widely used in the past few years. The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments*

*A common justification is that $\max(0, z)$ $max (0, z)$ doesn’t saturate in the limit of large $z$ $z$ , unlike sigmoid neurons, and this helps rectified linear units continue learning. The argument is fine, as far it goes, but it’s hardly a detailed justification, more of a just-so story. Note that we discussed the problems with saturation back in Chapter 2..

They got good results classifying benchmark data sets, and the practice has spread. In an ideal world we’d have a theory telling us which activation function to pick for which application. But at present we’re a long way from such a world. I should not be at all surprised if further major improvements can be obtained by an even better choice of activation function. And I also expect that in coming decades a powerful theory of activation functions will be developed. Today, we still have to rely on poorly understood rules of thumb and experience.

Expanding the training data: Another way we may hope to improve our results is by algorithmically expanding the training data. A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel. We can do this by running the program expand_mnist.py from the shell prompt*

*The code for expand_mnist.py is available here.:

$python expand_mnist.py</pre> Running this program takes the$ 50,000 $<mn>50</mn><mo>,</mo><mn>000</mn></math>"> MNIST training images, and prepares an expanded training set, with$ 250,000 $<mn>250</mn><mo>,</mo><mn>000</mn></math>"> training images. We can then use those training images to train our network. We'll use the same network as above, with rectified linear units. In my initial experiments I reduced the number of training epochs - this made sense, since we're training with$ 5 $<mn>5</mn></math>"> times as much data. But, in fact, expanding the data turned out to considerably reduce the effect of overfitting. And so, after some experimentation, I eventually went back to training for$ 60 $<mn>60</mn></math>"> epochs. In any case, let's train: <pre class="lang:python decode:true ">>>> expanded_training_data, _, _ = network3.load_data_shared( "../data/mnist_expanded.pkl.gz") >>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)</pre> Using the expanded training data I obtained a$ 99.37 $<mn>99.37</mn></math>"> percent training accuracy. So this almost trivial change gives a substantial improvement in classification accuracy. Indeed, as we <a style="color: #808080;" href="http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization">discussed earlier</a> this idea of algorithmically expanding the data can be taken further. Just to remind you of the flavour of some of the results in that earlier discussion: in 2003 Simard, Steinkraus and Platt* *<a style="color: #808080;" href="http://dx.doi.org/10.1109/ICDAR.2003.1227801">Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis</a>, by Patrice Simard, Dave Steinkraus, and John Platt (2003). improved their MNIST performance to$ 99.6 $<mn>99.6</mn></math>"> percent using a neural network otherwise very similar to ours, using two convolutional-pooling layers, followed by a hidden fully-connected layer with$ 100 $<mn>100</mn></math>"> neurons. There were a few differences of detail in their architecture - they didn't have the advantage of using rectified linear units, for instance - but the key to their improved performance was expanding the training data. They did this by rotating, translating, and skewing the MNIST training images. They also developed a process of "elastic distortion", a way of emulating the random oscillations hand muscles undergo when a person is writing. By combining all these processes they substantially increased the effective size of their training data, and that's how they achieved$ 99.6 $<mn>99.6</mn></math>"> percent accuracy. Inserting an extra fully-connected layer: Can we do even better? One possibility is to use exactly the same procedure as above, but to expand the size of the fully-connected layer. I tried with$ 300 $<mn>300</mn></math>"> and$ 1,000 $<mn>1</mn><mo>,</mo><mn>000</mn></math>"> neurons, obtaining results of$ 99.46 $<mn>99.46</mn></math>"> and$ 99.43 $<mn>99.43</mn></math>"> percent, respectively. That's interesting, but not really a convincing win over the earlier result ($ 99.37 $<mn>99.37</mn></math>"> percent). What about adding an extra fully-connected layer? Let's try inserting an extra fully-connected layer, so that we have two$ 100 $<mn>100</mn></math>">-hidden neuron fully-connected layers: <pre class="lang:python decode:true ">>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU), FullyConnectedLayer(n_in=100, n_out=100, activation_fn=ReLU), SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03, validation_data, test_data, lmbda=0.1)</pre> Doing this, I obtained a test accuracy of$ 99.43 $<mn>99.43</mn></math>"> percent. Again, the expanded net isn't helping so much. Running similar experiments with fully-connected layers containing$ 300 $<mn>300</mn></math>"> and$ 1,000 $<mn>1</mn><mo>,</mo><mn>000</mn></math>"> neurons yields results of$ 99.48 $<mn>99.48</mn></math>"> and$ 99.47 $<mn>99.47</mn></math>"> percent. That's encouraging, but still falls short of a really decisive win. What's going on here? Is it that the expanded or extra fully-connected layers really don't help with MNIST? Or might it be that our network has the capacity to do better, but we're going about learning the wrong way? For instance, maybe we could use stronger regularization techniques to reduce the tendency to overfit. One possibility is the <a style="color: #808080;" href="http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization">dropout</a> technique introduced back in Chapter 3. Recall that the basic idea of dropout is to remove individual activations at random while training the network. This makes the model more robust to the loss of individual pieces of evidence, and thus less likely to rely on particular idiosyncracies of the training data. Let's try applying dropout to the final fully-connected layers: <pre class="lang:python decode:true ">>>> net = Network([ ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5), poolsize=(2, 2), activation_fn=ReLU), ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5), poolsize=(2, 2), activation_fn=ReLU), FullyConnectedLayer( n_in=40*4*4, n_out=1000, activation_fn=ReLU, p_dropout=0.5), FullyConnectedLayer( n_in=1000, n_out=1000, activation_fn=ReLU, p_dropout=0.5), SoftmaxLayer(n_in=1000, n_out=10, p_dropout=0.5)], mini_batch_size) >>> net.SGD(expanded_training_data, 40, mini_batch_size, 0.03, validation_data, test_data)</pre> Using this, we obtain an accuracy of$ 99.60 $<mn>99.60</mn></math>"> percent, which is a substantial improvement over our earlier results, especially our main benchmark, the network with$ 100 $<mn>100</mn></math>"> hidden neurons, where we achieved$ 99.37 $<mn>99.37</mn></math>"> percent. There are two changes worth noting. First, I reduced the number of training epochs to$ 40 $<mn>40</mn></math>">: dropout reduced overfitting, and so we learned faster. Second, the fully-connected hidden layers have$ 1,000 $<mn>1</mn><mo>,</mo><mn>000</mn></math>"> neurons, not the$ 100 $<mn>100</mn></math>"> used earlier. Of course, dropout effectively omits many of the neurons while training, so some expansion is to be expected. In fact, I tried experiments with both$ 300 $<mn>300</mn></math>"> and$ 1,000 $<mn>1</mn><mo>,</mo><mn>000</mn></math>"> hidden neurons, and obtained (very slightly) better validation performance with$ 1,000 $<mn>1</mn><mo>,</mo><mn>000</mn></math>"> hidden neurons. Using an ensemble of networks: An easy way to improve performance still further is to create several neural networks, and then get them to vote to determine the best classification. Suppose, for example, that we trained <mn>5</mn></math>">5 different neural networks using the prescription above, with each achieving accuracies near to$ 99.6 $<mn>99.6</mn></math>"> percent. Even though the networks would all have similar accuracies, they might well make different errors, due to the different random initializations. It's plausible that taking a vote amongst our$ 5 $<mn>5</mn></math>"> networks might yield a classification better than any individual network. This sounds too good to be true, but this kind of ensembling is a common trick with both neural networks and other machine learning techniques. And it does in fact yield further improvements: we end up with$ 99.67 $<mn>99.67</mn></math>"> percent accuracy. In other words, our ensemble of networks classifies all but$ 33 $<mn>33</mn></math>"> of the$ 10,000 $<mn>10</mn><mo>,</mo><mn>000</mn></math>"> test images correctly. The remaining errors in the test set are shown below. The label in the top right is the correct classification, according to the MNIST data, while in the bottom right is the label output by our ensemble of nets: <center><img src="http://neuralnetworksanddeeplearning.com/images/ensemble_errors.png" alt="" width="580px" /></center>It's worth looking through these in detail. The first two digits, a$ 6 $and a$ 5 $, are genuine errors by our ensemble. However, they're also understandable errors, the kind a human could plausibly make. That$ 6 $really does look a lot like a$ 0 $, and the$ 5 $looks a lot like a$ 3 $. The third image, supposedly an 8, actually looks to me more like a$ 9 $. So I'm siding with the network ensemble here: I think it's done a better job than whoever originally drew the digit. On the other hand, the fourth image, the$ 6 $, really does seem to be classified badly by our networks. And so on. In most cases our networks' choices seem at least plausible, and in some cases they've done a better job classifying than the original person did writing the digit. Overall, our networks offer exceptional performance, especially when you consider that they correctly classified$ 9,967$ images which aren't shown. In that context, the few clear errors here seem quite understandable. Even a careful human makes the occasional mistake. And so I expect that only an extremely careful and methodical human would do much better. Our network is getting near to human performance.

───

日	一	二	三	四	五	六
« 6 月				8 月 »
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

FreeSandal

每日彙整: 2016-07-05

W!o+ 的《小伶鼬工坊演義》︰神經網絡【深度學習】四‧中

Trial and error

輕。鬆。學。部落客