W!o+ 的《小伶鼬工坊演義》︰神經網絡【超參數評估】四

詩經‧國風‧唐風‧鴇羽

肅肅鴇羽,集於苞栩。
王事靡盬,不能蓺稷黍。
父母何怙?悠悠蒼天,曷其有所?

肅肅鴇翼,集於苞棘。
王事靡盬,不能蓺黍稷。
父母何食?悠悠蒼天,曷其有極?

肅肅鴇行,集於苞桑,
王事靡盬,不能蓺稻梁。
父母何嚐?悠悠蒼天,曷其有常?

 

【譯文】

大雁簌簌拍翅膀,成群落在柞樹上。
王室差事做不完,無法去種黍子和高粱。
靠誰養活我爹娘?高高在上的老天爺,何時才能回家鄉?

大雁簌簌展翅飛,成群落在棗樹上。
王室差事做不完,無法去種黍子和高粱。
贍養父母哪有糧?高高在上的老天爺,做到何時才收場?

大雁簌簌飛成行,成群落在桑樹上。
王室差事做不完,無法去種稻子和高粱。
用啥去給父母嚐?高高在上的老天爺,生活何時能正常?

 

善讀書者不只能讀典章書籍,而且能讀天地之文。這首詩為什麼用『鴇鳥』與『樹上』之意象呢?維基百科詞條講︰

拼音bǎo注音:ㄅㄠˇ),學名Otididae,舊名Otidae,是分布於東半球的大型長腿狩獵鳥類,經常出現在乾燥而開闊的大草原。在鳥類分類學中屬於鶴形目鴇科。

鴇為雜食性鳥類,在地面上築巢。

 

知此就知道是借『駂鳥』之『天性』暗指人世間『反常』之無奈。《論語‧述而》裡,孔子說︰

不憤不啟,不悱不發,舉一隅不以三隅反,則不復也。

 

敘述好學者能『聞一知十』,無心為學則『舉一隅不以三隅反』。

藝』藝和巧『巧』二字『學而時習』者真積力久自得︰

《説文解字》

埶,種也。从坴、丮,持亟種之。《詩》曰:“我埶黍稷。”

巧,技也。从工,丂聲。

 

還請跟著 Michael Nielsen 的步伐,踏上自得之旅耶!!

Use early stopping to determine the number of training epochs: As we discussed earlier in the chapter, early stopping means that at the end of each epoch we should compute the classification accuracy on the validation data. When that stops improving, terminate. This makes setting the number of epochs very simple. In particular, it means that we don’t need to worry about explicitly figuring out how the number of epochs depends on the other hyper-parameters. Instead, that’s taken care of automatically. Furthermore, early stopping also automatically prevents us from overfitting. This is, of course, a good thing, although in the early stages of experimentation it can be helpful to turn off early stopping, so you can see any signs of overfitting, and use it to inform your approach to regularization.

……

Learning rate schedule: We’ve been holding the learning rate η constant. However, it’s often advantageous to vary the learning rate. Early on during the learning process it’s likely that the weights are badly wrong. And so it’s best to use a large learning rate that causes the weights to change quickly. Later, we can reduce the learning rate as we make more fine-tuned adjustments to our weights.

……

The regularization parameter,  \lambda︰I suggest starting initially with no regularization (\lambda = 0.0), and determining a value for \eta, as above. Using that choice of \eta, we can then use the validation data to select a good value for \lambda. Start by trialling \lambda = 1.0 *

*I don’t have a good principled justification for using this as a starting value. If anyone knows of a good principled discussion of where to start with λ, I’d appreciate hearing it (mn@michaelnielsen.org).

, and then increase or decrease by factors of 10, as needed to improve performance on the validation data. Once you’ve found a good order of magnitude, you can fine tune your value of \lambda. That done, you should return and re-optimize \eta again.

……

How I selected hyper-parameters earlier in this book: If you use the recommendations in this section you’ll find that you get values for \eta and \lambda which don’t always exactly match the values I’ve used earlier in the book. The reason is that the book has narrative constraints that have sometimes made it impractical to optimize the hyper-parameters. Think of all the comparisons we’ve made of different approaches to learning, e.g., comparing the quadratic and cross-entropy cost functions, comparing the old and new methods of weight initialization, running with and without regularization, and so on. To make such comparisons meaningful, I’ve usually tried to keep hyper-parameters constant across the approaches being compared (or to scale them in an appropriate way). Of course, there’s no reason for the same hyper-parameters to be optimal for all the different approaches to learning, so the hyper-parameters I’ve used are something of a compromise.

As an alternative to this compromise, I could have tried to optimize the heck out of the hyper-parameters for every single approach to learning. In principle that’d be a better, fairer approach, since then we’d see the best from every approach to learning. However, we’ve made dozens of comparisons along these lines, and in practice I found it too computationally expensive. That’s why I’ve adopted the compromise of using pretty good (but not necessarily optimal) choices for the hyper-parameters.

……

Mini-batch size: How should we set the mini-batch size? To answer this question, let’s first suppose that we’re doing online learning, i.e., that we’re using a mini-batch size of 1.

The obvious worry about online learning is that using mini-batches which contain just a single training example will cause significant errors in our estimate of the gradient. In fact, though, the errors turn out to not be such a problem. The reason is that the individual gradient estimates don’t need to be super-accurate. All we need is an estimate accurate enough that our cost function tends to keep decreasing. It’s as though you are trying to get to the North Magnetic Pole, but have a wonky compass that’s 10-20 degrees off each time you look at it. Provided you stop to check the compass frequently, and the compass gets the direction right on average, you’ll end up at the North Magnetic Pole just fine.

……

Automated techniques: I’ve been describing these heuristics as though you’re optimizing your hyper-parameters by hand. Hand-optimization is a good way to build up a feel for how neural networks behave. However, and unsurprisingly, a great deal of work has been done on automating the process. A common technique is grid search, which systematically searches through a grid in hyper-parameter space. A review of both the achievements and the limitations of grid search (with suggestions for easily-implemented alternatives) may be found in a 2012 paper* 

*Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio (2012).

by James Bergstra and Yoshua Bengio. Many more sophisticated approaches have also been proposed. I won’t review all that work here, but do want to mention a particularly promising 2012 paper which used a Bayesian approach to automatically optimize hyper-parameters*

*Practical Bayesian optimization of machine learning algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams.

. The code from the paper is publicly available, and has been used with some success by other researchers.

……

Summing up: Following the rules-of-thumb I’ve described won’t give you the absolute best possible results from your neural network. But it will likely give you a good start and a basis for further improvements. In particular, I’ve discussed the hyper-parameters largely independently. In practice, there are relationships between the hyper-parameters. You may experiment with \eta, feel that you’ve got it just right, then start to optimize for \lambda, only to find that it’s messing up your optimization for \eta. In practice, it helps to bounce backward and forward, gradually closing in good values. Above all, keep in mind that the heuristics I’ve described are rules of thumb, not rules cast in stone. You should be on the lookout for signs that things aren’t working, and be willing to experiment. In particular, this means carefully monitoring your network’s behaviour, especially the validation accuracy.

The difficulty of choosing hyper-parameters is exacerbated by the fact that the lore about how to choose hyper-parameters is widely spread, across many research papers and software programs, and often is only available inside the heads of individual practitioners. There are many, many papers setting out (sometimes contradictory) recommendations for how to proceed. However, there are a few particularly useful papers that synthesize and distill out much of this lore. Yoshua Bengio has a 2012 paper*

*Practical recommendations for gradient-based training of deep architectures, by Yoshua Bengio (2012).

that gives some practical recommendations for using backpropagation and gradient descent to train neural networks, including deep neural nets. Bengio discusses many issues in much more detail than I have, including how to do more systematic hyper-parameter searches. Another good paper is a 1998 paper*

*Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998)

by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller. Both these papers appear in an extremely useful 2012 book that collects many tricks commonly used in neural nets*

*Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller.

. The book is expensive, but many of the articles have been placed online by their respective authors with, one presumes, the blessing of the publisher, and may be located using a search engine.

One thing that becomes clear as you read these articles and, especially, as you engage in your own experiments, is that hyper-parameter optimization is not a problem that is ever completely solved. There’s always another trick you can try to improve performance. There is a saying common among writers that books are never finished, only abandoned. The same is also true of neural network optimization: the space of hyper-parameters is so large that one never really finishes optimizing, one only abandons the network to posterity. So your goal should be to develop a workflow that enables you to quickly do a pretty good job on the optimization, while leaving you the flexibility to try more detailed optimizations, if that’s important.

The challenge of setting hyper-parameters has led some people to complain that neural networks require a lot of work when compared with other machine learning techniques. I’ve heard many variations on the following complaint: “Yes, a well-tuned neural network may get the best performance on the problem. On the other hand, I can try a random forest [or SVM or insert your own favorite technique] and it just works. I don’t have time to figure out just the right neural network.” Of course, from a practical point of view it’s good to have easy-to-apply techniques. This is particularly true when you’re just getting started on a problem, and it may not be obvious whether machine learning can help solve the problem at all. On the other hand, if getting optimal performance is important, then you may need to try approaches that require more specialist knowledge. While it would be nice if machine learning were always easy, there is no a priori reason it should be trivially simple.

───