W!o+ 的《小伶鼬工坊演義》︰神經網絡【超參數評估】五

自然萬物是科學觀察之園地,技術啟發的寶庫。回聲可以定位︰

Animal echolocation

Echolocation, also called bio sonar, is the biological sonar used by several kinds of animals Echolocating animals emit calls out to the environment and listen to the echoes of those calls that return from various objects near them. They use these echoes to locate and identify the objects. Echolocation is used for navigation and for foraging (or hunting) in various environments. Some blind humans have learned to find their way using clicks produced by a device or by mouth.

Echolocating animals include some mammals and a few birds; most notably microchiropteran bats and odontocetes (toothed whales and dolphins), but also in simpler form in other groups such as shrews, one genus of megachiropteran bats (Rousettus) and two cave dwelling bird groups, the so-called cave swiftlets in the genus Aerodramus (formerly Collocalia) and the unrelated Oilbird Steatornis caripensis.[1]

 300px-Animal_echolocation.svg
A depiction of the ultrasound signals emitted by a bat, and the echo from a nearby object.

 

舞蹈可以傳意︰

蜜蜂舞蹈

蜜蜂舞蹈英語:waggle dance,蜜蜂八字形搖擺舞)為用於表達蜜蜂養殖行為中之蜜蜂特定八字形舞蹈(figure-eight dance)的一個術語。通過進行這個舞蹈,成功覓食者可以與群體的其他成員分享有關產生花蜜和花粉的花,水源,或以新的巢址位置的方向和距離的信息。[1][2]出自於奧地利生物學家諾貝爾獎得主卡爾·馮·弗里希在1940年代的研究翻譯其意義,工蜂在採完花蜜回到蜂巢之後,會進行兩種特別的移動方式。[3]研究對象是一種西方蜜蜂、為卡尼鄂拉蜂。當一隻工蜂回到巢中,其他工蜂會面向她,並以她為中心,就像在觀看這隻蜜蜂跳舞一樣。在發現提出之後經過多年的爭議,最後被大多數生物學家接受,並且成為當代生物學教科書中有關動物行為的經典教材。

Bee_dance

花朵的方向與太陽方向的夾角,等於搖臀直線與地心引力的夾角(α角)。

舞蹈的種類

搖臀舞

蜜蜂跳舞的移動路徑會形成一個8字形。外圍環狀部分稱做回歸區(return phase);中間直線部分稱做搖臀區(waggle phase),搖臀舞(Waggle dance)因此得名。蜜蜂會一邊搖動臀部一邊走過這條直線,搖臀的持續時間表示食物的距離,搖臀時間愈長,表示食物距離愈遠,以75毫秒代表100公尺。而這段直線與地心引力的方向之夾角,代表食物方向與太陽方向的夾角。之後更發現,蜜蜂會因太陽位置的相對移動而修正直線的角度。

環繞舞

環繞舞(Round dance),一開始被分類為另一種舞蹈,是工蜂用來表達蜂巢附近有食物的存在,但無法表達食物的距離與方向。通常使用在發現近距離的食物(距離小於50-60公尺)。然而後來的研究認為環繞舞並非獨立存在,而是搖臀舞的直線部分極短暫的版本。

搖擺舞交流演化

科學家透過觀察發現不同品種的蜜蜂擁有不同舞蹈的「語言」,每個品種或亞種舞蹈的弧度及時間都各有不同[4][5]一項近期研究顯示在東方蜜蜂西方蜜蜂共同居住的地區,二者能夠逐漸理解對方舞蹈中的「語言」[6]

Waggle_dance

西方蜜蜂的八字形搖擺舞。搖擺舞進行於垂直蜂巢45°的”向上”方向(A圖);即表示食物來源位於蜂巢(B圖)外之太陽右側45°(α角)向上方向。”舞蹈蜜蜂”的腹部因從一邊快速移動到另一邊故出現些許的模糊影像.

 

若說其是不學而能︰

本能

本能或稱先天行為,是指一個生物體趨向於某一特定行為的內在傾向。本能的最簡單例子就是鑰匙刺激(FAP),指的是對於一種可清晰界定的刺激,生物體會回應以一系列固定的動作,時間長度由很短到中等。

如果一個行為並非基於以前的經驗(也就是說不是通過學習而來的),它就可以被稱為本能,是內在生物因素的表現。例如,海灘上剛孵化出的海龜會自發地爬向大海。剛出生的有袋類會爬向母親的育兒袋。蜜蜂無需正式的協商就可以通過舞蹈來交流食物源的位置。其他的本能行為還包括動物的打鬥、求偶,逃跑以及築巢。

本能是與生俱來的複雜行為模式,在一個物種的絕大多數成員中都存在。它不應與反射 (生理學)相混淆,後者指的是一個器官對於特定刺激的簡單反應,例如瞳孔在強光下收縮,或者膝跳反射。意志能力的缺失不應被認為意味著無法改變行為模式。例如,人可以有意識地改變行為模式,只要意識到這個行為的存在並且簡單地停止行為,而其他不具備足夠強意志能力的動物一旦開始某個行為就不能停止。[1]

對於不同的物種,本能在行為中所扮演的角色各不相同。一種動物的神經系統越複雜,它的大腦皮質和社會學習所起的作用就越大,而本能的作用則越小。鱷 魚與大象之間的比較可以說明,哺乳動物是如何深度依賴社會學習。動物園中長大的母獅子和母猩猩如果在幼年時就離開了母親,常常會出現拒絕撫養其後代的情 況,因為它們並未習得養育後代。而這在較簡單的爬行動物身上並未出現。

BabyLoggerheadTlalcoyunque3

棱皮龜的幼體正在向大海爬去

 

那麼所謂『超參數』何謂耶???!!!

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【超參數評估】四

詩經‧國風‧唐風‧鴇羽

肅肅鴇羽,集於苞栩。
王事靡盬,不能蓺稷黍。
父母何怙?悠悠蒼天,曷其有所?

肅肅鴇翼,集於苞棘。
王事靡盬,不能蓺黍稷。
父母何食?悠悠蒼天,曷其有極?

肅肅鴇行,集於苞桑,
王事靡盬,不能蓺稻梁。
父母何嚐?悠悠蒼天,曷其有常?

 

【譯文】

大雁簌簌拍翅膀,成群落在柞樹上。
王室差事做不完,無法去種黍子和高粱。
靠誰養活我爹娘?高高在上的老天爺,何時才能回家鄉?

大雁簌簌展翅飛,成群落在棗樹上。
王室差事做不完,無法去種黍子和高粱。
贍養父母哪有糧?高高在上的老天爺,做到何時才收場?

大雁簌簌飛成行,成群落在桑樹上。
王室差事做不完,無法去種稻子和高粱。
用啥去給父母嚐?高高在上的老天爺,生活何時能正常?

 

善讀書者不只能讀典章書籍,而且能讀天地之文。這首詩為什麼用『鴇鳥』與『樹上』之意象呢?維基百科詞條講︰

拼音bǎo注音:ㄅㄠˇ),學名Otididae,舊名Otidae,是分布於東半球的大型長腿狩獵鳥類,經常出現在乾燥而開闊的大草原。在鳥類分類學中屬於鶴形目鴇科。

鴇為雜食性鳥類,在地面上築巢。

 

知此就知道是借『駂鳥』之『天性』暗指人世間『反常』之無奈。《論語‧述而》裡,孔子說︰

不憤不啟,不悱不發,舉一隅不以三隅反,則不復也。

 

敘述好學者能『聞一知十』,無心為學則『舉一隅不以三隅反』。

藝』藝和巧『巧』二字『學而時習』者真積力久自得︰

《説文解字》

埶,種也。从坴、丮,持亟種之。《詩》曰:“我埶黍稷。”

巧,技也。从工,丂聲。

 

還請跟著 Michael Nielsen 的步伐,踏上自得之旅耶!!

Use early stopping to determine the number of training epochs: As we discussed earlier in the chapter, early stopping means that at the end of each epoch we should compute the classification accuracy on the validation data. When that stops improving, terminate. This makes setting the number of epochs very simple. In particular, it means that we don’t need to worry about explicitly figuring out how the number of epochs depends on the other hyper-parameters. Instead, that’s taken care of automatically. Furthermore, early stopping also automatically prevents us from overfitting. This is, of course, a good thing, although in the early stages of experimentation it can be helpful to turn off early stopping, so you can see any signs of overfitting, and use it to inform your approach to regularization.

……

Learning rate schedule: We’ve been holding the learning rate η constant. However, it’s often advantageous to vary the learning rate. Early on during the learning process it’s likely that the weights are badly wrong. And so it’s best to use a large learning rate that causes the weights to change quickly. Later, we can reduce the learning rate as we make more fine-tuned adjustments to our weights.

……

The regularization parameter,  \lambda︰I suggest starting initially with no regularization (\lambda = 0.0), and determining a value for \eta, as above. Using that choice of \eta, we can then use the validation data to select a good value for \lambda. Start by trialling \lambda = 1.0 *

*I don’t have a good principled justification for using this as a starting value. If anyone knows of a good principled discussion of where to start with λ, I’d appreciate hearing it (mn@michaelnielsen.org).

, and then increase or decrease by factors of 10, as needed to improve performance on the validation data. Once you’ve found a good order of magnitude, you can fine tune your value of \lambda. That done, you should return and re-optimize \eta again.

……

How I selected hyper-parameters earlier in this book: If you use the recommendations in this section you’ll find that you get values for \eta and \lambda which don’t always exactly match the values I’ve used earlier in the book. The reason is that the book has narrative constraints that have sometimes made it impractical to optimize the hyper-parameters. Think of all the comparisons we’ve made of different approaches to learning, e.g., comparing the quadratic and cross-entropy cost functions, comparing the old and new methods of weight initialization, running with and without regularization, and so on. To make such comparisons meaningful, I’ve usually tried to keep hyper-parameters constant across the approaches being compared (or to scale them in an appropriate way). Of course, there’s no reason for the same hyper-parameters to be optimal for all the different approaches to learning, so the hyper-parameters I’ve used are something of a compromise.

As an alternative to this compromise, I could have tried to optimize the heck out of the hyper-parameters for every single approach to learning. In principle that’d be a better, fairer approach, since then we’d see the best from every approach to learning. However, we’ve made dozens of comparisons along these lines, and in practice I found it too computationally expensive. That’s why I’ve adopted the compromise of using pretty good (but not necessarily optimal) choices for the hyper-parameters.

……

Mini-batch size: How should we set the mini-batch size? To answer this question, let’s first suppose that we’re doing online learning, i.e., that we’re using a mini-batch size of 1.

The obvious worry about online learning is that using mini-batches which contain just a single training example will cause significant errors in our estimate of the gradient. In fact, though, the errors turn out to not be such a problem. The reason is that the individual gradient estimates don’t need to be super-accurate. All we need is an estimate accurate enough that our cost function tends to keep decreasing. It’s as though you are trying to get to the North Magnetic Pole, but have a wonky compass that’s 10-20 degrees off each time you look at it. Provided you stop to check the compass frequently, and the compass gets the direction right on average, you’ll end up at the North Magnetic Pole just fine.

……

Automated techniques: I’ve been describing these heuristics as though you’re optimizing your hyper-parameters by hand. Hand-optimization is a good way to build up a feel for how neural networks behave. However, and unsurprisingly, a great deal of work has been done on automating the process. A common technique is grid search, which systematically searches through a grid in hyper-parameter space. A review of both the achievements and the limitations of grid search (with suggestions for easily-implemented alternatives) may be found in a 2012 paper* 

*Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio (2012).

by James Bergstra and Yoshua Bengio. Many more sophisticated approaches have also been proposed. I won’t review all that work here, but do want to mention a particularly promising 2012 paper which used a Bayesian approach to automatically optimize hyper-parameters*

*Practical Bayesian optimization of machine learning algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams.

. The code from the paper is publicly available, and has been used with some success by other researchers.

……

Summing up: Following the rules-of-thumb I’ve described won’t give you the absolute best possible results from your neural network. But it will likely give you a good start and a basis for further improvements. In particular, I’ve discussed the hyper-parameters largely independently. In practice, there are relationships between the hyper-parameters. You may experiment with \eta, feel that you’ve got it just right, then start to optimize for \lambda, only to find that it’s messing up your optimization for \eta. In practice, it helps to bounce backward and forward, gradually closing in good values. Above all, keep in mind that the heuristics I’ve described are rules of thumb, not rules cast in stone. You should be on the lookout for signs that things aren’t working, and be willing to experiment. In particular, this means carefully monitoring your network’s behaviour, especially the validation accuracy.

The difficulty of choosing hyper-parameters is exacerbated by the fact that the lore about how to choose hyper-parameters is widely spread, across many research papers and software programs, and often is only available inside the heads of individual practitioners. There are many, many papers setting out (sometimes contradictory) recommendations for how to proceed. However, there are a few particularly useful papers that synthesize and distill out much of this lore. Yoshua Bengio has a 2012 paper*

*Practical recommendations for gradient-based training of deep architectures, by Yoshua Bengio (2012).

that gives some practical recommendations for using backpropagation and gradient descent to train neural networks, including deep neural nets. Bengio discusses many issues in much more detail than I have, including how to do more systematic hyper-parameter searches. Another good paper is a 1998 paper*

*Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998)

by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller. Both these papers appear in an extremely useful 2012 book that collects many tricks commonly used in neural nets*

*Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller.

. The book is expensive, but many of the articles have been placed online by their respective authors with, one presumes, the blessing of the publisher, and may be located using a search engine.

One thing that becomes clear as you read these articles and, especially, as you engage in your own experiments, is that hyper-parameter optimization is not a problem that is ever completely solved. There’s always another trick you can try to improve performance. There is a saying common among writers that books are never finished, only abandoned. The same is also true of neural network optimization: the space of hyper-parameters is so large that one never really finishes optimizing, one only abandons the network to posterity. So your goal should be to develop a workflow that enables you to quickly do a pretty good job on the optimization, while leaving you the flexibility to try more detailed optimizations, if that’s important.

The challenge of setting hyper-parameters has led some people to complain that neural networks require a lot of work when compared with other machine learning techniques. I’ve heard many variations on the following complaint: “Yes, a well-tuned neural network may get the best performance on the problem. On the other hand, I can try a random forest [or SVM or insert your own favorite technique] and it just works. I don’t have time to figure out just the right neural network.” Of course, from a practical point of view it’s good to have easy-to-apply techniques. This is particularly true when you’re just getting started on a problem, and it may not be obvious whether machine learning can help solve the problem at all. On the other hand, if getting optimal performance is important, then you may need to try approaches that require more specialist knowledge. While it would be nice if machine learning were always easy, there is no a priori reason it should be trivially simple.

───

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【超參數評估】三

一段平鋪直敘之文章,寫得清楚容易了解︰

Learning rate: Suppose we run three MNIST networks with three different learning rates, \eta = 0.025, \eta = 0.25 and \eta = 2.5, respectively. We’ll set the other hyper-parameters as for the experiments in earlier sections, running over 30 epochs, with a mini-batch size of 10, and with \lambda = 5.0. We’ll also return to using the full 50,000 training images. Here’s a graph showing the behaviour of the training cost as we train*

*The graph was generated by multiple_eta.py.:

With \eta = 0.025 the cost decreases smoothly until the final epoch. With \eta = 0.25 the cost initially decreases, but after about 20 epochs it is near saturation, and thereafter most of the changes are merely small and apparently random oscillations. Finally, with \eta =2.5 the cost makes large oscillations right from the start. To understand the reason for the oscillations, recall that stochastic gradient descent is supposed to step us gradually down into a valley of the cost function,

However, if \eta is too large then the steps will be so large that they may actually overshoot the minimum, causing the algorithm to climb up out of the valley instead. That’s likely*

*This picture is helpful, but it’s intended as an intuition-building illustration of what may go on, not as a complete, exhaustive explanation. Briefly, a more complete explanation is as follows: gradient descent uses a first-order approximation to the cost function as a guide to how to decrease the cost. For large η, higher-order terms in the cost function become more important, and may dominate the behaviour, causing gradient descent to break down. This is especially likely as we approach minima and quasi-minima of the cost function, since near such points the gradient becomes small, making it easier for higher-order terms to dominate behaviour.

what’s causing the cost to oscillate when \eta =2.5. When we choose \eta = 0.25 the initial steps do take us toward a minimum of the cost function, and it’s only once we get near that minimum that we start to suffer from the overshooting problem. And when we choose \eta = 0.025 we don’t suffer from this problem at all during the first 30 epochs. Of course, choosing η so small creates another problem, namely, that it slows down stochastic gradient descent. An even better approach would be to start with \eta = 0.25, train for 20 epochs, and then switch to \eta = 0.025. We’ll discuss such variable learning rate schedules later. For now, though, let’s stick to figuring out how to find a single good value for the learning rate, \eta.

With this picture in mind, we can set \eta as follows. First, we estimate the threshold value for \eta at which the cost on the training data immediately begins decreasing, instead of oscillating or increasing. This estimate doesn’t need to be too accurate. You can estimate the order of magnitude by starting with \eta = 0.01. If the cost decreases during the first few epochs, then you should successively try \eta = 0.1, 1.0, \ldots until you find a value for \eta where the cost oscillates or increases during the first few epochs. Alternately, if the cost oscillates or increases during the first few epochs when \eta = 0.01, then try \eta = 0.001, 0.0001, \ldots until you find a value for \eta where the cost decreases during the first few epochs. Following this procedure will give us an order of magnitude estimate for the threshold value of \eta. You may optionally refine your estimate, to pick out the largest value of \eta at which the cost decreases during the first few epochs, say \eta = 0.5 or \eta = 0.2 (there’s no need for this to be super-accurate). This gives us an estimate for the threshold value of \eta.

Obviously, the actual value of \eta that you use should be no larger than the threshold value. In fact, if the value of \eta is to remain usable over many epochs then you likely want to use a value for \eta that is smaller, say, a factor of two below the threshold. Such a choice will typically allow you to train for many epochs, without causing too much of a slowdown in learning.

In the case of the MNIST data, following this strategy leads to an estimate of 0.1 for the order of magnitude of the threshold value of \eta. After some more refinement, we obtain a threshold value \eta = 0.5. Following the prescription above, this suggests using \eta = 0.25 as our value for the learning rate. In fact, I found that using \eta = 0.5 worked well enough over 30 epochs that for the most part I didn’t worry about using a lower value of \eta.

This all seems quite straightforward. However, using the training cost to pick \eta appears to contradict what I said earlier in this section, namely, that we’d pick hyper-parameters by evaluating performance using our held-out validation data. In fact, we’ll use validation accuracy to pick the regularization hyper-parameter, the mini-batch size, and network parameters such as the number of layers and hidden neurons, and so on. Why do things differently for the learning rate? Frankly, this choice is my personal aesthetic preference, and is perhaps somewhat idiosyncratic. The reasoning is that the other hyper-parameters are intended to improve the final classification accuracy on the test set, and so it makes sense to select them on the basis of validation accuracy. However, the learning rate is only incidentally meant to impact the final classification accuracy. Its primary purpose is really to control the step size in gradient descent, and monitoring the training cost is the best way to detect if the step size is too big. With that said, this is a personal aesthetic preference. Early on during learning the training cost usually only decreases if the validation accuracy improves, and so in practice it’s unlikely to make much difference which criterion you use.

───

 

原本無需註釋,只因『美學偏好』 aesthetic preference 一語,幾句陳述啟人疑竇。比方『剃刀原理』喜歡『簡明理論』,可說是科學之『美學原則』。歐式幾何學從點、線、面等等『基本概念』,藉 著五大『公設』,推演整部幾何學,實起現今『公設法』之先河,可說是一種『美學』的『論述典範』。如是『超參數』居處『神經網絡』之先,必先擇取『破題』方能確立某一『神經網絡模型』,自當是以『驗證正確率』為依歸。不過這『學習率』Learning rate 卻是根植於『梯度下降法』方法論的『內廩參數』,故有此一分說的耶?或許 Michael Nielsen 先生暗指科學上還有一以『邏輯先後』為理據之傳統的乎!好比︰

乾坤萬象自有它之內蘊機理,好似程序能將系統的輸入轉成輸出,或許這正是科學所追求之『萬物理論』的吧??

生 ︰西方英國有學者,名作『史蒂芬‧沃爾夫勒姆』 Stephen Wolfram 創造『 Mathematica 』,曾寫

一種新科學》 A New Kind of Science

,分類『細胞自動機』, 欲究事物之本原。

Cellular automaton

Gospers_glider_gun

Oscillator

A cellular automaton (pl. cellular automata, abbrev. CA) is a discrete model studied in computability theory, mathematics, physics, complexity science, theoretical biology and microstructure modeling. Cellular automata are also called cellular spaces, tessellation automata, homogeneous structures, cellular structures, tessellation structures, and iterative arrays.[2]

A cellular automaton consists of a regular grid of cells, each in one of a finite number of states, such as on and off (in contrast to a coupled map lattice). The grid can be in any finite number of dimensions. For each cell, a set of cells called its neighborhood is defined relative to the specified cell. An initial state (time t = 0) is selected by assigning a state for each cell. A new generation is created (advancing t by 1), according to some fixed rule (generally, a mathematical function) that determines the new state of each cell in terms of the current state of the cell and the states of the cells in its neighborhood. Typically, the rule for updating the state of cells is the same for each cell and does not change over time, and is applied to the whole grid simultaneously, though exceptions are known, such as the stochastic cellular automaton and asynchronous cellular automaton.

─── 摘自《M♪o 之學習筆記本《辰》組元︰【䷀】萬象一原

 

如果緣起性空,本原何來?萬象何起??既然一切還原於『原子』 !!

還原論

還原論還原主義(英語:Reductionism,又譯化約論),是一種哲學思想,認為複雜的系統、事務、現象可以通過將其化解為各部分之組合的方法,加以理解和描述。

還原論的思想在自然科學中有很大影響,例如認為化學是以物理學為基礎,生物學是以化學為基礎,等等。在社會科學中,圍繞還原論的觀點有很大爭議,例如心理學是否能夠歸結於生物學社會學是否能歸結於心理學政治學能否歸結於社會學,等等。

───

Reductionism

Reductionism refers to several related but different philosophical positions regarding the connections between phenomena, or theories, “reducing” one to another, usually considered “simpler” or more “basic”.[1] The Oxford Companion to Philosophy suggests that it is “one of the most used and abused terms in the philosophical lexicon” and suggests a three part division:[2]

  1. Ontological reductionism: a belief that the whole of reality consists of a minimal number of parts
  2. Methodological reductionism: the scientific attempt to provide explanation in terms of ever smaller entities
  3. Theory reductionism: the suggestion that a newer theory does not replace or absorb the old, but reduces it to more basic terms. Theory reduction itself is divisible into three: translation, derivation and explanation.[3]

Reductionism can be applied to objects, phenomena, explanations, theories, and meanings.[3][4][5]

In the sciences, application of methodological reductionism attempts explanation of entire systems in terms of their individual, constituent parts and their interactions. Thomas Nagel speaks of psychophysical reductionism (the attempted reduction of psychological phenomena to physics and chemistry), as do others and physico-chemical reductionism (the attempted reduction of biology to physics and chemistry), again as do others.[6] In a very simplified and sometimes contested form, such reductionism is said to imply that a system is nothing but the sum of its parts.[4][7] However, a more nuanced view is that a system is composed entirely of its parts, but the system will have features that none of the parts have.[8] “The point of mechanistic explanations is usually showing how the higher level features arise from the parts.”[7]

Other definitions are used by other authors. For example, what Polkinghorne calls conceptual or epistemological reductionism[4] is the definition provided by Blackburn[9] and by Kim:[10] that form of reductionism concerning a program of replacing the facts or entities entering statements claimed to be true in one area of discourse with other facts or entities from another area, thereby providing a relationship between them. Such a connection is provided where the same idea can be expressed by “levels” of explanation, with higher levels reducible if need be to lower levels. This use of levels of understanding in part expresses our human limitations in grasping a lot of detail. However, “most philosophers would insist that our role in conceptualizing reality [our need for an hierarchy of “levels” of understanding] does not change the fact that different levels of organization in reality do have different properties.”[8]

As this introduction suggests, there are a variety of forms of reductionism, discussed in more detail in subsections below.

Reductionism strongly reflects a certain perspective on causality. In a reductionist framework, the phenomena that can be explained completely in terms of relations between other more fundamental phenomena, are called epiphenomena. Often there is an implication that the epiphenomenon exerts no causal agency on the fundamental phenomena that explain it. The epiphenomena are sometimes said to be “nothing but” the outcome of the workings of the fundamental phenomena, although the epiphenomena might be more clearly and efficiently described in very different terms. There is a tendency to avoid taking an epiphenomenon as being important in its own right. This attitude may extend to cases where the fundamentals are not clearly able to explain the epiphenomena, but are expected to by the speaker. In this way, for example, morality can be deemed to be “nothing but” evolutionary adaptation, and consciousness can be considered “nothing but” the outcome of neurobiological processes.

Reductionism does not preclude the existence of what might be called emergent phenomena, but it does imply the ability to understand those phenomena completely in terms of the processes from which they are composed. This reductionist understanding is very different from emergentism, which intends that what emerges in “emergence” is more than the sum of the processes from which it emerges.[11]

 

Duck_of_Vaucanson

Descartes held that non-human animals could be reductively explained as automata — De homine, 1662.

───

 

焉有

夸克

夸克英語:quark,又譯「層子」或「虧子」)是一種基本粒子,也是構成物質的基本單元。夸克互相結合,形成一種複合粒子,叫強子,強子中最穩定的是質子中子,它們是構成原子核的單元[1]。由於一種叫「夸克禁閉」的現象,夸克不能夠直接被觀測到,或是被分離出來;只能夠在強子裏面找到夸克[2][3]。就是因為這個原因,我們對夸克的所知大都是來自對強子的觀測。

我們知道夸克有六種,夸克的種類被稱為「」,它們是[4]。上及下夸克的質量是所有夸克中最低的。較重的夸克會通過一個叫粒子衰變的過程,來迅速地變成上或下夸克。粒子衰變是一個從高質量態變成低質量態的過程。就是因為這個原因,上及下夸克一般來說很穩定,所以它們在宇宙中很常見,而奇、魅、頂及底則只能經由高能粒子的碰撞產生(例如宇宙射線粒子加速器)。

夸克有著多種不同的內在特性,包括電荷色荷自旋質量等。在標準模型中,夸克是唯一一種能經受全部四種基本交互作用的基本粒子,基本交互作用有時會被稱為「基本力」(電磁重力強交互作用弱交互作用)。夸克同時是現時已知唯一一種基本電荷整數的粒子。夸克每一種味都有一種對應的反粒子,叫反夸克,它跟夸克的不同之處,只在於它的一些特性跟夸克大小一樣但正負不同

夸克模型分別由默里·蓋爾曼喬治·茨威格於1964年獨立地提出[5]。引入夸克這一概念,是為了能更好地整理各種強子,而當時並沒有甚麼能證實夸克存在的物理證據,直到1968年SLAC開發出深度非彈性散射實驗為止[6][7]。夸克的六種味已經全部被加速器實驗所觀測到;而於1995年在費米實驗室被觀測到的頂夸克,是最後發現的一種[5]

Standard_Model_of_Elementary_Particles_zh-hant

標準模型中的粒子有六種是夸克(圖中用紫色表示)。左邊的三行中,每一行構成物質的一

───

 

耶!!豈是梯度下降法之數學以及歷史淵源

Gradient descent

Euler method

 

不足以解釋『學習率』 \eta 何指的乎??終究還是得回歸『解析』或『數值計算』之審思明辨的吧!!??

Stiff equation

In mathematics, a stiff equation is a differential equation for which certain numerical methods for solving the equation are numerically unstable, unless the step size is taken to be extremely small. It has proven difficult to formulate a precise definition of stiffness, but the main idea is that the equation includes some terms that can lead to rapid variation in the solution.

When integrating a differential equation numerically, one would expect the requisite step size to be relatively small in a region where the solution curve displays much variation and to be relatively large where the solution curve straightens out to approach a line with slope nearly zero. For some problems this is not the case. Sometimes the step size is forced down to an unacceptably small level in a region where the solution curve is very smooth. The phenomenon being exhibited here is known as stiffness. In some cases we may have two different problems with the same solution, yet problem one is not stiff and problem two is stiff. Clearly the phenomenon cannot be a property of the exact solution, since this is the same for both problems, and must be a property of the differential system itself. It is thus appropriate to speak of stiff systems.

Motivating example

Consider the initial value problem

\,y'(t)=-15y(t),\quad t\geq 0,y(0)=1. (1)

The exact solution (shown in cyan) is

y(t)=e^{-15t}\, with y ( t ) → 0y(t)\to 0 as t → ∞ .t\to \infty . (2)

We seek a numerical solution that exhibits the same behavior.

The figure (right) illustrates the numerical issues for various numerical integrators applied on the equation.

  1. Euler’s method with a step size of h = 1/4 oscillates wildly and quickly exits the range of the graph (shown in red).
  2. Euler’s method with half the step size, h = 1/8, produces a solution within the graph boundaries, but oscillates about zero (shown in green).
  3. The trapezoidal method (that is, the two-stage Adams–Moulton method) is given by
    {\displaystyle y_{n+1}=y_{n}+{\frac {1}{2}}h\left(f(t_{n},y_{n})+f(t_{n+1},y_{n+1})\right),} (3)

    where y ′ = f ( t , y ) {\displaystyle \textstyle y'=f(t,y)}. Applying this method instead of Euler’s method gives a much better result (blue). The numerical results decrease monotonically to zero, just as the exact solution does.

One of the most prominent examples of the stiff ODEs is a system that describes the chemical reaction of Robertson:

  {\dot {x}}=-0.04x+10^{4}y\cdot z
{\dot {y}}=0.04x-10^{4}y\cdot z-3\cdot 10^{7}y^{2}
{\dot {z}}=3\cdot 10^{7}y^{2}
(4)

If one treats this system on a short interval, for example, t ∈ [ 0 , 40 ] {\displaystyle t\in [0,40]} t\in [0,40] there is no problem in numerical integration. However, if the interval is very large (1011 say), then many standard codes fail to integrate it correctly.

Additional examples are the sets of ODEs resulting from the temporal integration of large chemical reaction mechanisms. Here, the stiffness arises from the coexistence of very slow and very fast reactions. To solve them, the software packages KPP and Autochem can be used.

500px-StiffEquationNumericalSolvers.svg

Explicit numerical methods exhibiting instability when integrating a stiff ordinary differential equation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【超參數評估】二

人說孔雀東南飛

孔雀東南飛

漢末建安中,廬江府小吏焦仲卿妻劉氏,為仲卿母斫遣,自誓不嫁。其家逼之, 乃沒水而死。仲卿聞之,亦自縊于庭樹。時人傷之,為詩云爾。

孔雀東南飛,五里一徘徊。十三能織素,十四學裁衣,十五彈箜篌 ,十六誦詩書,十七為君婦,心中常悲苦。君既為府吏,守節情不移。賤妾留空房,相見長日稀。雞鳴入機織,夜夜不得息,三日斷五疋,大人故嫌遲。非為織作遲,君家婦難為。妾不堪驅使,徒留無所施。便可白公姥,及時相遣歸。府吏得聞之,堂上啟阿母:兒已薄祿相,幸復得此婦。結髮共枕席,黃泉共為友,共事二三年,始爾未為久。女行無偏斜,何意致不厚?阿母謂府吏,何乃太區區 !此婦無禮節,舉動自專由。吾意久懷忿,汝豈得自由。東家有賢女,自名秦羅敷。可憐體無比,阿母為汝求,便可速遣之,遣去慎莫留。府吏長跪告,伏惟啟阿母。今若遣此婦,終老不復取。阿母得聞之,槌床便大怒:小子無所畏,何敢助婦語。吾已失恩義,會不相從許。府吏默無聲,再拜還入戶。舉言謂新婦,哽咽不能語。我自不驅卿,逼迫有阿母。卿但暫還家,吾今且報府。不久當歸還 ,還必相迎取。以此下心意,慎勿違吾語。新婦謂府吏,勿復重紛紜。往昔初陽歲,謝家來貴門。奉事循公姥,進止敢自專,晝夜勤作息,伶娉縈苦辛。謂言無罪過,供養卒大恩。仍更被驅遣,何言復來還。妾有繡腰襦,葳蕤自生光。紅羅複斗帳,四角垂香囊。箱簾六七十,綠碧青絲繩。物物各自異,種種在其中。人賤物亦鄙,不足迎後人。留待作遣施,於今無會因,時時為安慰,久久莫相忘 。雞鳴外欲曙,新婦起嚴妝,著我繡裌裙,事事四五通,足下躡絲履,頭上玳瑁光。腰若流紈素,耳著明月璫。指如削蔥根,口如含朱丹,纖纖作細步,精妙世無雙。上堂謝阿母,母聽去不止。昔作女兒時,生小出野里,本自無教訓,兼愧貴家子。受母錢帛多,不堪母驅使,今日還家去,令母勞家裡。卻與小姑別,淚落連珠子。新婦初來時,小姑始扶床,今日被驅遣,小姑如我長,勤心養公姥 ,好自相扶將。初七及下九,嬉戲莫相忘。出門登車去,涕落百餘行。府吏馬在前,新婦車在後。隱隱何田田,俱會大道口。下馬入車中,低頭共耳語。誓不相隔卿,且暫還家去。吾今且赴府。不久當還歸,誓天不相負。新婦謂府吏:感君區區懷,君既若見錄,不久望君來。君當作磐石,妾當作蒲葦,蒲葦紉如絲,磐石無轉移。我有親父兄,性行暴如雷。恐不任我意,逆以煎我懷。舉手長勞勞 ,二情同依依。入門上家堂,進退無顏儀。十七遣汝嫁,謂言無誓違。汝今無罪過,不迎而自歸。蘭芝慚阿母:兒實無罪過。阿母大悲摧。還家十餘日,縣令遣媒來。云有第三郎,窈窕世無雙。年始十八九,便言多令才。阿母謂阿女:汝可去應之。阿女銜淚答:蘭芝初還時,府吏見丁寧,結誓不別離。今日違情義,恐此事非奇。
自可斷來信,徐徐更謂之。阿母白媒人,貧賤有此女,始適還家門 。不堪吏人婦,豈合令郎君。幸可廣問訊,不得便相許。媒人去數日,尋遣丞請還。誰有蘭家女,丞籍有宦官。云有第五郎,嬌逸未有婚。遣丞為媒人,主簿通語言。直說太守家,有此令郎君。既欲結大義,故遣來貴門。阿母謝媒人:女子先有誓,老姥豈敢言。阿兄得聞之,悵然心中煩。舉言謂阿妹:作計何不量?先嫁得府吏,後嫁得郎君。否泰如天地,足以榮汝身。不嫁義郎體,其住欲何云 。蘭芝仰頭答,理實如兄言。謝家事夫婿,中道還兄門。處分適兄意,那得自任專。雖與府吏要,渠會永無緣。登即相許和,便可作婚姻。媒人下床去,諾諾復爾爾。還部白府君,下官奉使命。言談大有緣。府君得聞之,心中大歡喜。視曆復開書,便利此月內。六合正相應,良吉三十日。今已二十七。卿可去成婚。交語連裝束,絡繹如浮雲。青雀白鵠舫,四角龍子幡。婀娜隨風轉,金車玉作輪 。躑躅青驄馬,流蘇金縷鞍。齋錢三百萬,皆用青絲穿。雜綵三百匹,交廣市鮭珍。從人四五百,鬱鬱登郡門。阿母謂阿女:適得府君書,明日來迎汝。何不作衣裳,莫令事不舉。阿女默無聲,手巾掩口啼,淚落便如瀉。移我琉璃榻,出置前窗下。左手持刀尺,朝成繡裌裙,晚成單羅衫。晻晻日欲暝。愁思出門啼。府吏聞此變,因求假暫歸。右手執綾羅。未至二三里,摧藏馬悲哀。新婦識馬聲 ,躡履相逢迎。悵然遙相望,知是故人來。舉手拍馬鞍,嗟歎使心傷。自君別我後,人事不可量。果不如先願,又非君所詳。我有親父母,逼迫兼弟兄,以我應他人,君還何所望。府吏謂新婦:賀卿得高遷。磐石方且厚,可以卒千年。蒲葦一時紉,便作旦夕間。卿當日勝貴,吾獨向黃泉。新婦謂府吏:何意出此言。同是被逼迫,君爾妾亦然。黃泉下相見,勿違今日言。執手分道去,各各還家門 。生人作死別,恨恨那可論。念與世間辭,千萬不復全。府吏還家去,上堂拜阿母:今日大風寒,寒風摧樹木,嚴霜結庭蘭。兒今日冥冥,令母在後單。故作不良計,勿復怨鬼神。命如南山石,四體康且直。阿母得聞之,零淚應聲落。汝是大家子,仕宦於臺閣。慎勿為婦死,貴賤情何薄。東家有賢女,窈窕艷城郭。阿母為汝求,便復在旦夕。府吏再拜還,長歎空房中,作計乃爾立。轉頭向戶裡 ,漸見愁煎迫。其日牛馬嘶。新婦入青廬。菴菴黃昏後,寂寂人定初。我命絕今日,魂去尸長留。攬裙脫絲履,舉身赴清池。府吏聞此事,心知長別離。徘徊庭樹下,自掛東南枝。兩家求合葬,合葬華山傍。東西值松柏,左右種梧桐。枝枝相覆蓋,葉葉相交通。中有雙飛鳥,自名為鴛鴦。仰頭相向鳴,夜夜達五更。行人駐足聽,
寡婦起傍徨。多謝後世人,戒之慎勿忘。

 

惟因西北有高樓

古詩十九首‧《西北有高樓》之五

西北有高樓,上與浮雲齊。
交疏結綺窗,阿閣三重階。
上有弦歌聲,音響一何悲!
誰能爲此曲,無乃杞梁妻。
清商隨風發,中曲正徘徊。
一彈再三歎,慷慨有餘哀。
不惜歌者苦,但傷知音稀。
願爲雙鴻鵠,奮翅起高飛。

先天西北艮為山,伏羲東南兌是澤。天造地設山澤戀,何事惹得是非怨?誰知運命出後天,西北乾遇巽東南,天風姤起謗言生,文王八卦機緣轉!

Michael Nielsen 先生選擇用滾瓜爛熟之例宣說『廣義策略』︰

Broad strategy: When using neural networks to attack a new problem the first challenge is to get any non-trivial learning, i.e., for the network to achieve results better than chance. This can be surprisingly difficult, especially when confronting a new class of problem. Let’s look at some strategies you can use if you’re having this kind of trouble.

或因已學『先天易』乎??深曉未知『登高難』耶!!

Suppose, for example, that you’re attacking MNIST for the first time. You start out enthusiastic, but are a little discouraged when your first network fails completely, as in the example above. The way to go is to strip the problem down. Get rid of all the training and validation images except images which are 0s or 1s. Then try to train a network to distinguish 0s from 1s. Not only is that an inherently easier problem than distinguishing all ten digits, it also reduces the amount of training data by 80 percent, speeding up training by a factor of 5. That enables much more rapid experimentation, and so gives you more rapid insight into how to build a good network.

You can further speed up experimentation by stripping your network down to the simplest network likely to do meaningful learning. If you believe a [784, 10] network can likely do better-than-chance classification of MNIST digits, then begin your experimentation with such a network. It’ll be much faster than training a [784, 30, 10] network, and you can build back up to the latter.

You can get another speed up in experimentation by increasing the frequency of monitoring. In network2.py we monitor performance at the end of each training epoch. With 50,000 images per epoch, that means waiting a little while – about ten seconds per epoch, on my laptop, when training a [784, 30, 10] network – before getting feedback on how well the network is learning. Of course, ten seconds isn’t very long, but if you want to trial dozens of hyper-parameter choices it’s annoying, and if you want to trial hundreds or thousands of choices it starts to get debilitating. We can get feedback more quickly by monitoring the validation accuracy more often, say, after every 1,000 training images. Furthermore, instead of using the full 10,000 image validation set to monitor performance, we can get a much faster estimate using just 100 validation images. All that matters is that the network sees enough images to do real learning, and to get a pretty good rough estimate of performance. Of course, our program network2.py doesn’t currently do this kind of monitoring. But as a kludge to achieve a similar effect for the purposes of illustration, we’ll strip down our training data to just the first 1,000 MNIST training images. Let’s try it and see what happens. (To keep the code below simple I haven’t implemented the idea of using only 0 and 1 images. Of course,

>>> net = network2.Network([784, 10])
>>> net.SGD(training_data[:1000], 30, 10, 10.0, lmbda = 1000.0, \
... evaluation_data=validation_data[:100], \
... monitor_evaluation_accuracy=True)
Epoch 0 training complete
Accuracy on evaluation data: 10 / 100

Epoch 1 training complete
Accuracy on evaluation data: 10 / 100

Epoch 2 training complete
Accuracy on evaluation data: 10 / 100
...

We’re still getting pure noise! But there’s a big win: we’re now getting feedback in a fraction of a second, rather than once every ten seconds or so. That means you can more quickly experiment with other choices of hyper-parameter, or even conduct experiments trialling many different choices of hyper-parameter nearly simultaneously.

In the above example I left \lambda as \lambda =1000.0, as we used earlier. But since we changed the number of training examples we should really change \lambda to keep the weight decay the same. That means changing \lambda to 20.0. If we do that then this is what happens:

>>> net = network2.Network([784, 10])
>>> net.SGD(training_data[:1000], 30, 10, 10.0, lmbda = 20.0, \
... evaluation_data=validation_data[:100], \
... monitor_evaluation_accuracy=True)
Epoch 0 training complete
Accuracy on evaluation data: 12 / 100

Epoch 1 training complete
Accuracy on evaluation data: 14 / 100

Epoch 2 training complete
Accuracy on evaluation data: 25 / 100

Epoch 3 training complete
Accuracy on evaluation data: 18 / 100
...

 

Ahah! We have a signal. Not a terribly good signal, but a signal nonetheless. That’s something we can build on, modifying the hyper-parameters to try to get further improvement. Maybe we guess that our learning rate needs to be higher. (As you perhaps realize, that’s a silly guess, for reasons we’ll discuss shortly, but please bear with me.) So to test our guess we try dialing \eta up to 100.0:

>>> net = network2.Network([784, 10])
>>> net.SGD(training_data[:1000], 30, 10, 100.0, lmbda = 20.0, \
... evaluation_data=validation_data[:100], \
... monitor_evaluation_accuracy=True)
Epoch 0 training complete
Accuracy on evaluation data: 10 / 100

Epoch 1 training complete
Accuracy on evaluation data: 10 / 100

Epoch 2 training complete
Accuracy on evaluation data: 10 / 100

Epoch 3 training complete
Accuracy on evaluation data: 10 / 100

...

 

That’s no good! It suggests that our guess was wrong, and the problem wasn’t that the learning rate was too low. So instead we try dialing \eta down to \eta = 1.0:

>>> net = network2.Network([784, 10])
>>> net.SGD(training_data[:1000], 30, 10, 1.0, lmbda = 20.0, \
... evaluation_data=validation_data[:100], \
... monitor_evaluation_accuracy=True)
Epoch 0 training complete
Accuracy on evaluation data: 62 / 100

Epoch 1 training complete
Accuracy on evaluation data: 42 / 100

Epoch 2 training complete
Accuracy on evaluation data: 43 / 100

Epoch 3 training complete
Accuracy on evaluation data: 61 / 100

...

 

That’s better! And so we can continue, individually adjusting each hyper-parameter, gradually improving performance. Once we’ve explored to find an improved value for η, then we move on to find a good value for \lambda. Then experiment with a more complex architecture, say a network with 10 hidden neurons. Then adjust the values for \eta and \lambda again. Then increase to 20 hidden neurons. And then adjust other hyper-parameters some more. And so on, at each stage evaluating performance using our held-out validation data, and using those evaluations to find better and better hyper-parameters. As we do so, it typically takes longer to witness the impact due to modifications of the hyper-parameters, and so we can gradually decrease the frequency of monitoring.

This all looks very promising as a broad strategy. However, I want to return to that initial stage of finding hyper-parameters that enable a network to learn anything at all. In fact, even the above discussion conveys too positive an outlook. It can be immensely frustrating to work with a network that’s learning nothing. You can tweak hyper-parameters for days, and still get no meaningful response. And so I’d like to re-emphasize that during the early stages you should make sure you can get quick feedback from experiments. Intuitively, it may seem as though simplifying the problem and the architecture will merely slow you down. In fact, it speeds things up, since you much more quickly find a network with a meaningful signal. Once you’ve got such a signal, you can often get rapid improvements by tweaking the hyper-parameters. As with many things in life, getting started can be the hardest thing to do.

Okay, that’s the broad strategy. Let’s now look at some specific recommendations for setting hyper-parameters. I will focus on the learning rate, \eta, the L2 regularization parameter, \lambda, and the mini-batch size. However, many of the remarks apply also to other hyper-parameters, including those associated to network architecture, other forms of regularization, and some hyper-parameters we’ll meet later in the book, such as the momentum co-efficient.

───

 

設若以西遊記

第六十八回 朱紫國唐僧論前世 孫行者施為三折肱

朕西牛賀洲朱紫國王,自立業以來,四方平服,百姓清安。近因國事不祥,沉痾伏枕,淹延日久難痊。本國太醫院屢選良方,未能調治。今出此榜文,普招天下賢士。不拘北往東來,中華外國,若有精醫藥者,請登寶殿,療理朕躬。稍得病愈,願將社稷平分,決不虛示。為此出給張掛。須至榜者。

覽畢,滿心歡喜道:「古人云:『行動有三分財氣。』早是不在館中獃坐。即此不必買甚調和,且把取經事寧耐一日,等老孫做個醫生耍耍。」

為範,欲用『神經網絡』行孫悟空把脈之事 ︰

切診

切診,包括脈診按診兩部分,是醫生運用雙手對病人的一定部位進行觸、摸、按壓,從而了解疾病情況的方法。脈診是按脈搏按診是對病人的肌膚、手及其病變部位的觸摸按壓,以測知局部冷熱、軟硬、壓痛、包塊或其他異常的變化,從而推斷疾病的部位和性質的一種診察方法。

常見病脈

一般來說,一個健康人的脈象應為呼吸之間跳動四次,寸關尺三部之脈和緩有力,不浮不沉。常見的病脈,主要有等。

  • 浮脈:浮的意思是脈位浮於表面,輕按可得,重按則減。表證由於外感病邪停留於表,因此脈氣鼓動於外,脈位淺顯。浮而有力則表實;浮而無力則表虛。虛陽外浮,脈浮大無力為危證。
  • 沉脈:跟浮脈相反,沉脈脈位輕按而不得。主要為裡證。跟浮脈相似,有力為裡實,無力則為裡虛。
  • 數脈:數脈跟遲脈相反,意即脈象跳動頻密,每分鐘跳動九十次以上。主為熱證,有力為實熱,無力為虛熱
  • 遲脈:遲的意思,是脈頻跳動遲緩,平均每分鐘跳動六十次以下。主病為寒證,有力為實寒,無力為虛寒
  • 虛脈:寸關尺三部脈皆無力,重按則空虛。主為虛證。
  • 實脈:寸關尺三部脈象皆有力,主為實證。
  • 滑脈:脈象滑如走珠,按之流利為滑脈,乃健康氣血充實之表徵。滑數之象,則為喜脈
  • 洪脈:洪脈,就是如洪水一般的意思。脈大而有力,波濤洶湧,來盛去衰。主為熱盛。
  • 細脈:脈細小如線,起落明顯為細脈,主為虛證。
  • 弦脈:脈按之如按琴弦。主為病、痛證、飲。

───

隨縁居

郭琛 2011-06-12, 德國

二十八脈

浮脈類:浮、洪、濡、散、芤、革。浮脈類的波形如下: 脈象舉取, 尋取, 按取 的結構. 此類特徵是脈博在舉取遠較尋取、按取.

1

 

將如之何哉???

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【超參數評估】一

禮記‧學記卷十八

記問之學,不足以為人師。必也其聽語乎。力不能問,然後語之;語之而不知,雖舍之可也。

良冶之子,必學為裘;良弓之子,必學為箕;始駕馬者反之,車在馬前。君子察於此三者,可以有志於學矣。

古之學者,比物醜類。鼓無當於五聲,五聲弗得不和。水無當於五色,五色弗得不章。學無當於五官,五官弗得不治。師無當於五服 ,五服弗得不親。

君子曰:「大德不官,大道不器,大信不約,大時不齊。」察於此四者,可以有志於本矣。

三王之祭川也。皆先河而後海;或源也,或委也。此之謂務本。

一字千金
一字千金

唐太宗‧聖教序
唐太宗‧聖教序

蓋聞二儀有像,顯覆載以含生;四時無形,潛寒暑以化物。是以窺天鑒地,庸愚皆識其端;明陰洞陽,賢哲罕窮其數。然而天地苞乎陰陽而易識者,以其有像也;陰陽處乎天地而難窮者,以其無形也。故知像顯可征,雖愚不惑;形潛莫睹,在智猶迷。況乎佛道崇虛,乘幽控寂,弘濟萬品,典禦十方,舉威靈而無上,抑神力而無下。大之則彌於宇宙,細之則攝於毫厘。無滅無生,曆千劫而不古;若隱若顯,運百福而長今。妙道凝玄,遵之莫知其際;法流湛寂,挹之莫測其源。故知蠢蠢凡愚,區區庸鄙,投其旨趣,能無疑惑者哉!

然 則大教之興,基乎西土,騰漢庭而皎夢,照東域而流慈。昔者,分形分跡之時,言未馳而成化;當常現常之世,民仰德而知遵。及乎晦影歸真,遷儀越世,金容掩 色,不鏡三千之光;麗象開圖,空端四八之相。於是微言廣被,拯含類於三塗;遺訓遐宣,導群生於十地。然而真教難仰,莫能一其旨歸,曲學易遵,邪正於焉紛 糾。所以空有之論,或習俗而是非;大小之乘,乍沿時而隆替。

玄奘法師者, 法門之領袖也。幼懷貞敏,早悟三空之心;長契神情,先苞四忍之行。松風水月,未足比其清華;仙露明珠,詎能方其朗潤。故以智通無累,神測未形,超六塵而迥 出,只千古而無對。凝心內境,悲正法之陵遲;棲慮玄門,慨深文之訛謬。思欲分條析理,廣彼前聞,截偽續真,開茲後學。是以翹心淨土,往遊西域。乘危遠邁, 杖策孤征。積雪晨飛,途閑失地;驚砂夕起,空外迷天。萬裏山川,撥煙霞而進影;百重寒暑,躡霜雨(別本有作「雪」者)而前蹤。誠重勞輕,求深願達,周遊西 宇,十有七年。窮曆道邦,詢求正教,雙林八水,味道餐風,鹿苑鷲峰,瞻奇仰異。承至言於先聖,受真教於上賢,探賾妙門,精窮奧業。一乘五律之道,馳驟於心 田;八藏三篋之文,波濤於口海。

爰 自所曆之國,總將三藏要文,凡六百五十七部,譯布中夏,宣揚勝業。引慈雲於西極,注法雨於東垂,聖教缺而複全,蒼生罪而還福。濕火宅之幹焰,共拔迷途;朗 愛水之昏波,同臻彼岸。是知惡因業墜,善以緣升,升墜之端,惟人所托。譬夫桂生高嶺,雲露方得泫其華;蓮出淥波,飛塵不能汙其葉。非蓮性自潔而桂質本貞, 良由所附者高,則微物不能累;所憑者淨,則濁類不能沾。夫以卉木無知,猶資善而成善,況乎人倫有識,不緣慶而求慶!方冀茲經流施,將日月而無窮;斯福遐 敷,與乾坤而永大。朕才謝珪璋。言慚博達。至於內典。尤所未閑。昨制序文。深為鄙拙。唯恐穢翰墨於金簡。標瓦礫於珠林。忽得來書。謬承褒贊。循躬省慮。彌 蓋厚顏。善不足稱,空勞致謝。皇帝在春宮述三藏。聖記。

夫 顯揚正教,非智無以廣其文。崇闡微言。非賢莫能定其旨。蓋真如聖教者。諸法之玄宗。眾經之軌(足屬)也。綜括宏遠。奧旨遐深。極空有之精微。體生滅之機 要。詞茂道曠。尋之者不究其源。文顯義幽。履之者莫測其際。故知聖慈所被。業無善而不臻。妙化所敷。緣無惡而不翦。開法網之綱紀。弘六度之正教。拯群有之 塗炭。啟三藏之秘扃。是以名無翼而長飛。道無根而永固。道名流慶。曆遂古而鎮常。赴感應身。經塵劫而不朽。晨鐘夕梵。交二音於鷲峰。慧日法流。轉雙輪於鹿 菀。排空寶蓋。接翔雲而共飛。莊野春林。與天花而合彩。

伏 惟皇帝陛下。上玄資福。垂拱而治八荒。德被黔黎。斂衽而朝萬國。恩加朽骨。石室歸貝葉之文。澤及昆蟲。金匱流梵說之偈。遂使阿(禾辱)達水。通神旬之八 川。耆阇崛山。接嵩華之翠嶺。竊以性德凝寂。麋歸心而不通。智地玄奧。感懇誠而遂顯。豈謂重昏之夜。燭慧炬之光。火宅之朝。降法雨之澤。於是百川異流。同 會於海。萬區分義。總成乎實。豈與湯武校其優劣。堯舜比其聖德者哉。玄奘法師者。夙懷聰令。立志夷簡。神清齠齔之年。體拔浮華之世。凝情定室。匿跡幽巖。 棲息三禪。巡遊十地,超六塵之境。獨步迦維。會一乘之旨。隨機化物。以中華之無質。尋印度之真文。遠涉恒河。終期滿字。頻登雪嶺。更獲半珠。問道法還。十 有七載。備通釋典。利物為心。以貞觀十九年九月六日奉。

敕於弘福寺。翻譯聖教要文凡六百五十七部。引大海之法流。洗塵勞而不竭。傳智燈之長焰。皎幽闇而恒明。自非久值勝緣。何以顯揚斯旨。所謂法相常住。齊三光之明。

我皇福臻。同二儀之固。伏見禦制。眾經論序。照古騰今。理含金石之聲。文抱風雲之潤。治輒以輕塵足嶽。墜露添流。略舉大綱。以為斯記。

治素無才學。性不聰敏。內典諸文。殊未觀覽。所作論序。鄙拙尤繁。忽見來書。褒揚贊述。撫躬自省。慚悚交並。勞師等遠臻。深以為愧。

貞觀廿二年八月三日內府。

記‧問』之學,非但不足以『為師』,或許也不可『師法』。為什麼呢?『事實』與『資料』無論所積所聚多麼『龐大』且『詳實』,並不能夠自動產生『科學理論』!一個『科學理論』將那『龐大』『事實』和『詳實』『資料』貫串聯繫起來,才能形成『體系』,方可用來說明『萬象』!!

也就是說『知識』之『網絡』往往是『縱橫』『聯繫』的,『概念』的『經緯』常常會『上下』『貫串』。

據聞當初 阿隆佐‧邱奇 Alonzo Church 用『λ 運算』研究『可計算性』問題時,並不知道它自身就是一個『世界上最小的通用性程式語言』。因為『函式』與『變元』兩者是任何人不管想用哪種『□□程式語言』來寫『演算法Algorithm 都需要清楚理解的『概念』。抽象精巧正是為什麼,讀過『λ 運算』的人,多半覺得它既『難懂』又『難解』。這是有原因的,如果用『抽象辦法』談論著『抽象事物』,又不知道為何如此表述當然『難懂』;假使不能『困思勉行』多次的『深思熟慮』 ,以至於能夠一旦了悟那就自然『難解』。通常越是『基本』的概念,由於太過『直覺』了,反而容易『誤解』。就像化學元素『週期表』上的元素不過一一八個,它所構成的世界卻是千嬌萬媚繁多複雜,要講『』的『性質』與『作用』,也許一大本書都不能窮盡,但換個方向說鐵不就是日用之物的嗎?

─── 摘自《W!o 的派生‧十日談之《二》

 

也許因記問之學難成大器乎??所以 Michael Nielsen 先生作此難題連連之文︰

How to choose a neural network’s hyper-parameters?

Up until now I haven’t explained how I’ve been choosing values for hyper-parameters such as the learning rate, η, the regularization parameter, λ, and so on. I’ve just been supplying values which work pretty well. In practice, when you’re using neural nets to attack a problem, it can be difficult to find good hyper-parameters. Imagine, for example, that we’ve just been introduced to the MNIST problem, and have begun working on it, knowing nothing at all about what hyper-parameters to use. Let’s suppose that by good fortune in our first experiments we choose many of the hyper-parameters in the same way as was done earlier this chapter: 30 hidden neurons, a mini-batch size of 10, training for 30 epochs using the cross-entropy. But we choose a learning rate η=10.0 and regularization parameter λ=1000.0. Here’s what I saw on one such run:

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 10.0, lmbda = 1000.0,
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
Epoch 0 training complete
Accuracy on evaluation data: 1030 / 10000

Epoch 1 training complete
Accuracy on evaluation data: 990 / 10000

Epoch 2 training complete
Accuracy on evaluation data: 1009 / 10000

...

Epoch 27 training complete
Accuracy on evaluation data: 1009 / 10000

Epoch 28 training complete
Accuracy on evaluation data: 983 / 10000

Epoch 29 training complete
Accuracy on evaluation data: 967 / 10000

Our classification accuracies are no better than chance! Our network is acting as a random noise generator!

“Well, that’s easy to fix,” you might say, “just decrease the learning rate and regularization hyper-parameters”. Unfortunately, you don’t a priori know those are the hyper-parameters you need to adjust. Maybe the real problem is that our 30 hidden neuron network will never work well, no matter how the other hyper-parameters are chosen? Maybe we really need at least 100 hidden neurons? Or 300 hidden neurons? Or multiple hidden layers? Or a different approach to encoding the output? Maybe our network is learning, but we need to train for more epochs? Maybe the mini-batches are too small? Maybe we’d do better switching back to the quadratic cost function? Maybe we need to try a different approach to weight initialization? And so on, on and on and on. It’s easy to feel lost in hyper-parameter space. This can be particularly frustrating if your network is very large, or uses a lot of training data, since you may train for hours or days or weeks, only to get no result. If the situation persists, it damages your confidence. Maybe neural networks are the wrong approach to your problem? Maybe you should quit your job and take up beekeeping?

In this section I explain some heuristics which can be used to set the hyper-parameters in a neural network. The goal is to help you develop a workflow that enables you to do a pretty good job setting hyper-parameters. Of course, I won’t cover everything about hyper-parameter optimization. That’s a huge subject, and it’s not, in any case, a problem that is ever completely solved, nor is there universal agreement amongst practitioners on the right strategies to use. There’s always one more trick you can try to eke out a bit more performance from your network. But the heuristics in this section should get you started.

───

 

或當思由於人人『尺』尺『寸』寸

《說文解字》

寸,十分也。人手卻一寸,動脈,謂之寸口。从又,从一。凡寸之屬皆从寸。

尺,十寸也。人手卻十分動脈爲寸口。十寸爲尺。尺,所以指尺規榘事也。从尸,从乙。乙,所識也。周制,寸、尺、咫、 尋、常、仞諸度量,皆以 人之體爲法。凡尺之屬皆从尺。

 

每每不同,所以如此把脈定象耶!!

是否能仿效取法自身『身』經絡『經 絡』呢??!!

如果說有一門研究『夢的科學』稱作夢學 Oneirology ,為什麼直到今天『經絡』還排徊在『主流科學圈』之外?難到『針灸』真能以其『無用性』就可存在了數千年??但是『理論』之『實效性』並不能夠『釋疑』,就彷彿『熱力學』的『』,直到玻爾茲曼用著『統計力學』來『定義』,它的『意義』或許方被釐清!!所以在那還沒有『系統理論』的古早之前,就有《黃帝內經》之『經絡系統』的『五臟六腑』體系論述,實在是『很可疑』!這樣看來《周公解夢》也就是解『非理性之夢』的了!!

Meridians 【經絡】in acupuncture 【針灸】and infrared imaging
Shui-yin Lo

Summary The meridians in acupuncture are hypothesized to be made up of polarized molecules. Quantum excitations, quasi-particles and others are assumed to be the media of communication between different parts of the body connected by meridians. Infrared pictures are taken to depict the effect of acupuncture on one acupoint of a meridian to a far away pain area.

HYPOTHESIS
Acupuncture has been around for many thousands of years in China and has achieved good results in both man and animals. It has also recently begun to gain wide acceptance in the West. However, despite many scientific studies, it has still failed to achieve the recognition 【承認】it needs within mainstream orthodox scientific circles. Many studies over the past 40 years have shown that electric conductivity 【導電性】on acupuncture points (1±4) is lower than that on neighboring points. One of the most recent studies has been carried out using functional magnetic resonance imaging (fMRI)【功能性磁振造影】; it has reported the correlation between vision acupoints in the foot and corresponding brain cortices 【皮層】. When acupuncture stimulation is per formed on a vision-related acupoint (located on the lateral aspect of the foot), fMRI shows activation of the occipital lobes 【枕葉】. Stimulation of the eye using direct light results in similar activation in the occipital lobes when visualized by fMRI.

Two main questions need to be answered in a modern scientific way:

1. What are meridians?

2. What is the qi 【氣】that is supposed to circulate around the meridians?

The theory behind acupuncture is that the body has a system of meridians which channel 【形成河道】some kind of substance, energy, or information that has been vaguely called qi in the literature. Unfortunately, so far, when one dissects 【解剖】the human body, one does not find any substance that distinguishes the meridians from their surrounding tissues, quite unlike other human systems such as the nerve system or the blood system. Therefore the most likely explanation is that meridians are made up of same ordinary molecules that make up other living materials surrounding them with the exception that they are more ordered. These ordered molecules are neutral but electrically polarized. This provides the natural explanation on the concept of the balance of yin 【陰】and yang 【陽】in Chinese medicine as the neutralization 【中和】of negative and positive charges in electricity. Our hypothesis is then as follows:

The meridians are made up of electrically polarized molecules. On the meridians there are quantum phenomena 【量子現象】such as excitations, quasi-particles 【準粒子】, etc. that account for significant properties of meridians. These polarized molecules line up their polarity to form bigger clusters 【叢集】. Specifically, they are most likely water molecules 【水分子】that group together to form water clusters, which have permanent electric dipole moment. These water clusters then line up together to form the meridians. It has been suggested for a long time that water plays a very active role in the living state of the human body.

……

當□□的理論,用○○的話語來講,『意義』是否會『改變』?『現象』自然就『道遷』??『理化』當真能『不同』???天經地義的該是︰

真理不分東西學問沒有國界!!科學不在人種!!!

回歸到︰

事實』是『決疑』之依據,『實驗』是『決疑』的方法。

─── 摘自《M♪o 之 TinyIoT ︰ 《承轉》之《決疑‧上》!!