W!o+ 的《小伶鼬工坊演義》︰神經網絡【深度學習】四‧中

題李凝幽居》唐‧賈島

閑居少鄰並,草徑入荒園。
鳥宿池邊樹,僧敲月下門。
過橋分野色,移石動雲根。
暫去還來此,幽期不負言。

元和六年(811),謁韓愈,以詩深得賞識。賈島是著名的苦吟派詩人,著名的典故“推敲”即出自他處。傳說賈島在長安跨驢背吟“鳥宿池邊樹,僧敲月下門”,鍊“推” 、“敲”字不決,後世乃以斟酌文字爲“推敲”。賈島在韓門時,與張籍孟郊馬戴 姚合往來酬唱甚密。他擅長五律, 苦吟成癖。其詩造語奇特,給人印象深刻,常寫荒寒冷落之景,表現愁苦幽獨之情。如“獨行潭底影  ,數息樹邊身”,“歸吏封宵鑰,行蛇入古桐”等句。這類慘淡經營的詩句,構成他奇僻清峭的風格,給人以枯寂陰黯之感。也有於幽獨中表現清美意境的詩和語言 質樸自然、感情純真率直、風格豪爽雄健的詩。

 

傳說賈島專心『推』、『敲』之際,不知直闖韓愈轎前。因而得受『一』字之教 ─── 一字師 ───。韓愈以為『鳥宿池邊樹』入夜也,直『推』門恐有唐突之嫌,不如先『敲』門矣。

文字推敲,語詞鍛鍊,意境組合排列之嘗試,此賈島『苦吟』之法 。實通『天下第一法』 ── 試誤法 ───

Trial and error

Trial and error is a fundamental method of solving problems.[1] It is characterised by repeated, varied attempts which are continued until success,[2] or until the agent stops trying.

According to W.H. Thorpe, the term was devised by C. Lloyd Morgan after trying out similar phrases “trial and failure” and “trial and practice”.[3] Under Morgan’s Canon, animal behaviour should be explained in the simplest possible way. Where behaviour seems to imply higher mental processes, it might be explained by trial-and-error learning. An example is the skillful way in which his terrier Tony opened the garden gate, easily misunderstood as an insightful act by someone seeing the final behaviour. Lloyd Morgan, however, had watched and recorded the series of approximations by which the dog had gradually learned the response, and could demonstrate that no insight was required to explain it.

Edward Thorndike showed how to manage a trial-and-error experiment in the laboratory. In his famous experiment, a cat was placed in a series of puzzle boxes in order to study the law of effect in learning.[4] He plotted learning curves which recorded the timing for each trial. Thorndike’s key observation was that learning was promoted by positive results, which was later refined and extended by B.F. Skinner‘s operant conditioning.

Trial and error is also a heuristic method of problem solving, repair, tuning, or obtaining knowledge. In the field of computer science, the method is called generate and test. In elementary algebra, when solving equations, it is “guess and check”.

This approach can be seen as one of the two basic approaches to problem solving, contrasted with an approach using insight and theory. However, there are intermediate methods which for example, use theory to guide the method, an approach known as guided empiricism.

Computer_graph_created_by_trial_and_error

Trial with PC

 

的門徑。且聽 Michael Nielsen 先生弘揚『試誤法』之精神︰

Using rectified linear units: The network we’ve developed at this point is actually a variant of one of the networks used in the seminal 1998 paper*

*“Gradient-based learning applied to document recognition”, by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998). There are many differences of detail, but broadly speaking our network is quite similar to the networks described in the paper.

introducing the MNIST problem, a network known as LeNet-5. It’s a good foundation for further experimentation, and for building up understanding and intuition. In particular, there are many ways we can vary the network in an attempt to improve our results.

As a beginning, let’s change our neurons so that instead of using a sigmoid activation function, we use rectified linear units. That is, we’ll use the activation function f(z) \equiv \max(0, z). We’ll train for 60 epochs, with a learning rate of \eta = 0.03. I also found that it helps a little to use some l2 regularization, with regularization parameter \lambda = 0.1:

>>> from network3 import ReLU
>>> net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), 
                      filter_shape=(20, 1, 5, 5), 
                      poolsize=(2, 2), 
                      activation_fn=ReLU),
        ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), 
                      filter_shape=(40, 20, 5, 5), 
                      poolsize=(2, 2), 
                      activation_fn=ReLU),
        FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.03, 
            validation_data, test_data, lmbda=0.1)

I obtained a classification accuracy of 99.23 percent. It’s a modest improvement over the sigmoid results (99.06). However, across all my experiments I found that networks based on rectified linear units consistently outperformed networks based on sigmoid activation functions. There appears to be a real gain in moving to rectified linear units for this problem.

What makes the rectified linear activation function better than the sigmoid or tanh functions? At present, we have a poor understanding of the answer to this question. Indeed, rectified linear units have only begun to be widely used in the past few years. The reason for that recent adoption is empirical: a few people tried rectified linear units, often on the basis of hunches or heuristic arguments*

*A common justification is that \max(0, z) doesn’t saturate in the limit of large z, unlike sigmoid neurons, and this helps rectified linear units continue learning. The argument is fine, as far it goes, but it’s hardly a detailed justification, more of a just-so story. Note that we discussed the problems with saturation back in Chapter 2..

They got good results classifying benchmark data sets, and the practice has spread. In an ideal world we’d have a theory telling us which activation function to pick for which application. But at present we’re a long way from such a world. I should not be at all surprised if further major improvements can be obtained by an even better choice of activation function. And I also expect that in coming decades a powerful theory of activation functions will be developed. Today, we still have to rely on poorly understood rules of thumb and experience.

Expanding the training data: Another way we may hope to improve our results is by algorithmically expanding the training data. A simple way of expanding the training data is to displace each training image by a single pixel, either up one pixel, down one pixel, left one pixel, or right one pixel. We can do this by running the program expand_mnist.py from the shell prompt*

*The code for expand_mnist.py is available here.:

python expand_mnist.py</pre> <span style="color: #808080;">Running this program takes the50,000<span id="MathJax-Element-115-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>50</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-655" class="math"><span id="MathJax-Span-656" class="mrow"><span id="MathJax-Span-659" class="mn"></span></span></span></span> MNIST training images, and prepares an expanded training set, with250,000<span id="MathJax-Element-116-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>250</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-660" class="math"><span id="MathJax-Span-661" class="mrow"><span id="MathJax-Span-664" class="mn"></span></span></span></span> training images. We can then use those training images to train our network. We'll use the same network as above, with rectified linear units. In my initial experiments I reduced the number of training epochs - this made sense, since we're training with5<span id="MathJax-Element-117-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>5</mn></math>"><span id="MathJax-Span-665" class="math"><span id="MathJax-Span-666" class="mrow"><span id="MathJax-Span-667" class="mn"></span></span></span></span> times as much data. But, in fact, expanding the data turned out to considerably reduce the effect of overfitting. And so, after some experimentation, I eventually went back to training for60<span id="MathJax-Element-118-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>60</mn></math>"><span id="MathJax-Span-668" class="math"><span id="MathJax-Span-669" class="mrow"><span id="MathJax-Span-670" class="mn"></span></span></span></span> epochs. In any case, let's train:</span> <pre class="lang:python decode:true ">>>> expanded_training_data, _, _ = network3.load_data_shared(         "../data/mnist_expanded.pkl.gz") >>> net = Network([         ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),                        filter_shape=(20, 1, 5, 5),                        poolsize=(2, 2),                        activation_fn=ReLU),         ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12),                        filter_shape=(40, 20, 5, 5),                        poolsize=(2, 2),                        activation_fn=ReLU),         FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU),         SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03,              validation_data, test_data, lmbda=0.1)</pre> <span style="color: #808080;">Using the expanded training data I obtained a99.37<span id="MathJax-Element-119-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.37</mn></math>"><span id="MathJax-Span-671" class="math"><span id="MathJax-Span-672" class="mrow"><span id="MathJax-Span-673" class="mn"></span></span></span></span> percent training accuracy. So this almost trivial change gives a substantial improvement in classification accuracy. Indeed, as we <a style="color: #808080;" href="http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization">discussed earlier</a> this idea of algorithmically expanding the data can be taken further. Just to remind you of the flavour of some of the results in that earlier discussion: in 2003 Simard, Steinkraus and Platt*</span>  <span class="marginnote" style="color: #808080;">*<a style="color: #808080;" href="http://dx.doi.org/10.1109/ICDAR.2003.1227801">Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis</a>, by Patrice Simard, Dave Steinkraus, and John Platt (2003).</span>  <span style="color: #808080;">improved their MNIST performance to99.6<span id="MathJax-Element-120-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.6</mn></math>"><span id="MathJax-Span-674" class="math"><span id="MathJax-Span-675" class="mrow"><span id="MathJax-Span-676" class="mn"></span></span></span></span> percent using a neural network otherwise very similar to ours, using two convolutional-pooling layers, followed by a hidden fully-connected layer with100<span id="MathJax-Element-121-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>100</mn></math>"><span id="MathJax-Span-677" class="math"><span id="MathJax-Span-678" class="mrow"><span id="MathJax-Span-679" class="mn"></span></span></span></span> neurons. There were a few differences of detail in their architecture - they didn't have the advantage of using rectified linear units, for instance - but the key to their improved performance was expanding the training data. They did this by rotating, translating, and skewing the MNIST training images. They also developed a process of "elastic distortion", a way of emulating the random oscillations hand muscles undergo when a person is writing. By combining all these processes they substantially increased the effective size of their training data, and that's how they achieved99.6<span id="MathJax-Element-122-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.6</mn></math>"><span id="MathJax-Span-680" class="math"><span id="MathJax-Span-681" class="mrow"><span id="MathJax-Span-682" class="mn"></span></span></span></span> percent accuracy.</span>  <span style="color: #808080;"><strong>Inserting an extra fully-connected layer:</strong> Can we do even better? One possibility is to use exactly the same procedure as above, but to expand the size of the fully-connected layer. I tried with300<span id="MathJax-Element-123-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>300</mn></math>"><span id="MathJax-Span-683" class="math"><span id="MathJax-Span-684" class="mrow"><span id="MathJax-Span-685" class="mn"></span></span></span></span> and1,000<span id="MathJax-Element-124-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-686" class="math"><span id="MathJax-Span-687" class="mrow"><span id="MathJax-Span-690" class="mn"></span></span></span></span> neurons, obtaining results of99.46<span id="MathJax-Element-125-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.46</mn></math>"><span id="MathJax-Span-691" class="math"><span id="MathJax-Span-692" class="mrow"><span id="MathJax-Span-693" class="mn"></span></span></span></span> and99.43<span id="MathJax-Element-126-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.43</mn></math>"><span id="MathJax-Span-694" class="math"><span id="MathJax-Span-695" class="mrow"><span id="MathJax-Span-696" class="mn"></span></span></span></span> percent, respectively. That's interesting, but not really a convincing win over the earlier result (99.37<span id="MathJax-Element-127-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.37</mn></math>"><span id="MathJax-Span-697" class="math"><span id="MathJax-Span-698" class="mrow"><span id="MathJax-Span-699" class="mn"></span></span></span></span> percent).</span>  <span style="color: #808080;">What about adding an extra fully-connected layer? Let's try inserting an extra fully-connected layer, so that we have two100<span id="MathJax-Element-128-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>100</mn></math>"><span id="MathJax-Span-700" class="math"><span id="MathJax-Span-701" class="mrow"><span id="MathJax-Span-702" class="mn"></span></span></span></span>-hidden neuron fully-connected layers:</span> <pre class="lang:python decode:true ">>>> net = Network([         ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),                        filter_shape=(20, 1, 5, 5),                        poolsize=(2, 2),                        activation_fn=ReLU),         ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12),                        filter_shape=(40, 20, 5, 5),                        poolsize=(2, 2),                        activation_fn=ReLU),         FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU),         FullyConnectedLayer(n_in=100, n_out=100, activation_fn=ReLU),         SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size) >>> net.SGD(expanded_training_data, 60, mini_batch_size, 0.03,              validation_data, test_data, lmbda=0.1)</pre> <span style="color: #808080;">Doing this, I obtained a test accuracy of99.43<span id="MathJax-Element-129-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.43</mn></math>"><span id="MathJax-Span-703" class="math"><span id="MathJax-Span-704" class="mrow"><span id="MathJax-Span-705" class="mn"></span></span></span></span> percent. Again, the expanded net isn't helping so much. Running similar experiments with fully-connected layers containing300<span id="MathJax-Element-130-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>300</mn></math>"><span id="MathJax-Span-706" class="math"><span id="MathJax-Span-707" class="mrow"><span id="MathJax-Span-708" class="mn"></span></span></span></span> and1,000<span id="MathJax-Element-131-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-709" class="math"><span id="MathJax-Span-710" class="mrow"><span id="MathJax-Span-713" class="mn"></span></span></span></span> neurons yields results of99.48<span id="MathJax-Element-132-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.48</mn></math>"><span id="MathJax-Span-714" class="math"><span id="MathJax-Span-715" class="mrow"><span id="MathJax-Span-716" class="mn"></span></span></span></span> and99.47<span id="MathJax-Element-133-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.47</mn></math>"><span id="MathJax-Span-717" class="math"><span id="MathJax-Span-718" class="mrow"><span id="MathJax-Span-719" class="mn"></span></span></span></span> percent. That's encouraging, but still falls short of a really decisive win.</span>  <span style="color: #808080;">What's going on here? Is it that the expanded or extra fully-connected layers really don't help with MNIST? Or might it be that our network has the capacity to do better, but we're going about learning the wrong way? For instance, maybe we could use stronger regularization techniques to reduce the tendency to overfit. One possibility is the <a style="color: #808080;" href="http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization">dropout</a> technique introduced back in Chapter 3. Recall that the basic idea of dropout is to remove individual activations at random while training the network. This makes the model more robust to the loss of individual pieces of evidence, and thus less likely to rely on particular idiosyncracies of the training data. Let's try applying dropout to the final fully-connected layers:</span> <pre class="lang:python decode:true ">>>> net = Network([         ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),                        filter_shape=(20, 1, 5, 5),                        poolsize=(2, 2),                        activation_fn=ReLU),         ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12),                        filter_shape=(40, 20, 5, 5),                        poolsize=(2, 2),                        activation_fn=ReLU),         FullyConnectedLayer(             n_in=40*4*4, n_out=1000, activation_fn=ReLU, p_dropout=0.5),         FullyConnectedLayer(             n_in=1000, n_out=1000, activation_fn=ReLU, p_dropout=0.5),         SoftmaxLayer(n_in=1000, n_out=10, p_dropout=0.5)],          mini_batch_size) >>> net.SGD(expanded_training_data, 40, mini_batch_size, 0.03,              validation_data, test_data)</pre> <span style="color: #808080;">Using this, we obtain an accuracy of99.60<span id="MathJax-Element-134-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.60</mn></math>"><span id="MathJax-Span-720" class="math"><span id="MathJax-Span-721" class="mrow"><span id="MathJax-Span-722" class="mn"></span></span></span></span> percent, which is a substantial improvement over our earlier results, especially our main benchmark, the network with100<span id="MathJax-Element-135-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>100</mn></math>"><span id="MathJax-Span-723" class="math"><span id="MathJax-Span-724" class="mrow"><span id="MathJax-Span-725" class="mn"></span></span></span></span> hidden neurons, where we achieved99.37<span id="MathJax-Element-136-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.37</mn></math>"><span id="MathJax-Span-726" class="math"><span id="MathJax-Span-727" class="mrow"><span id="MathJax-Span-728" class="mn"></span></span></span></span> percent.</span>  <span style="color: #808080;">There are two changes worth noting.</span>  <span style="color: #808080;">First, I reduced the number of training epochs to40<span id="MathJax-Element-137-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>40</mn></math>"><span id="MathJax-Span-729" class="math"><span id="MathJax-Span-730" class="mrow"><span id="MathJax-Span-731" class="mn"></span></span></span></span>: dropout reduced overfitting, and so we learned faster.</span>  <span style="color: #808080;">Second, the fully-connected hidden layers have1,000<span id="MathJax-Element-138-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-732" class="math"><span id="MathJax-Span-733" class="mrow"><span id="MathJax-Span-736" class="mn"></span></span></span></span> neurons, not the100<span id="MathJax-Element-139-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>100</mn></math>"><span id="MathJax-Span-737" class="math"><span id="MathJax-Span-738" class="mrow"><span id="MathJax-Span-739" class="mn"></span></span></span></span> used earlier. Of course, dropout effectively omits many of the neurons while training, so some expansion is to be expected. In fact, I tried experiments with both300<span id="MathJax-Element-140-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>300</mn></math>"><span id="MathJax-Span-740" class="math"><span id="MathJax-Span-741" class="mrow"><span id="MathJax-Span-742" class="mn"></span></span></span></span> and1,000<span id="MathJax-Element-141-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-743" class="math"><span id="MathJax-Span-744" class="mrow"><span id="MathJax-Span-747" class="mn"></span></span></span></span> hidden neurons, and obtained (very slightly) better validation performance with1,000<span id="MathJax-Element-142-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-748" class="math"><span id="MathJax-Span-749" class="mrow"><span id="MathJax-Span-752" class="mn"></span></span></span></span> hidden neurons.</span>  <span style="color: #808080;"><strong>Using an ensemble of networks:</strong> An easy way to improve performance still further is to create several neural networks, and then get them to vote to determine the best classification. Suppose, for example, that we trained <span id="MathJax-Element-143-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>5</mn></math>"><span id="MathJax-Span-753" class="math"><span id="MathJax-Span-754" class="mrow"><span id="MathJax-Span-755" class="mn">5</span></span></span></span> different neural networks using the prescription above, with each achieving accuracies near to99.6<span id="MathJax-Element-144-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.6</mn></math>"><span id="MathJax-Span-756" class="math"><span id="MathJax-Span-757" class="mrow"><span id="MathJax-Span-758" class="mn"></span></span></span></span> percent. Even though the networks would all have similar accuracies, they might well make different errors, due to the different random initializations. It's plausible that taking a vote amongst our5<span id="MathJax-Element-145-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>5</mn></math>"><span id="MathJax-Span-759" class="math"><span id="MathJax-Span-760" class="mrow"><span id="MathJax-Span-761" class="mn"></span></span></span></span> networks might yield a classification better than any individual network.</span>  <span style="color: #808080;">This sounds too good to be true, but this kind of ensembling is a common trick with both neural networks and other machine learning techniques. And it does in fact yield further improvements: we end up with99.67<span id="MathJax-Element-146-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>99.67</mn></math>"><span id="MathJax-Span-762" class="math"><span id="MathJax-Span-763" class="mrow"><span id="MathJax-Span-764" class="mn"></span></span></span></span> percent accuracy. In other words, our ensemble of networks classifies all but33<span id="MathJax-Element-147-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>33</mn></math>"><span id="MathJax-Span-765" class="math"><span id="MathJax-Span-766" class="mrow"><span id="MathJax-Span-767" class="mn"></span></span></span></span> of the10,000<span id="MathJax-Element-148-Frame" class="MathJax" tabindex="0" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>10</mn><mo>,</mo><mn>000</mn></math>"><span id="MathJax-Span-768" class="math"><span id="MathJax-Span-769" class="mrow"><span id="MathJax-Span-772" class="mn"></span></span></span></span> test images correctly.</span>  <span style="color: #808080;">The remaining errors in the test set are shown below. The label in the top right is the correct classification, according to the MNIST data, while in the bottom right is the label output by our ensemble of nets:</span>  <center><img src="http://neuralnetworksanddeeplearning.com/images/ensemble_errors.png" alt="" width="580px" /></center><span style="color: #808080;">It's worth looking through these in detail. The first two digits, a6and a5, are genuine errors by our ensemble. However, they're also understandable errors, the kind a human could plausibly make. That6really does look a lot like a0, and the5looks a lot like a3. The third image, supposedly an 8, actually looks to me more like a9. So I'm siding with the network ensemble here: I think it's done a better job than whoever originally drew the digit. On the other hand, the fourth image, the6, really does seem to be classified badly by our networks.</span>  <span style="color: #808080;">And so on. In most cases our networks' choices seem at least plausible, and in some cases they've done a better job classifying than the original person did writing the digit. Overall, our networks offer exceptional performance, especially when you consider that they correctly classified9,967$ images which aren't shown. In that context, the few clear errors here seem quite understandable. Even a careful human makes the occasional mistake. And so I expect that only an extremely careful and methodical human would do much better. Our network is getting near to human performance.

───