W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】五

《莊子‧外篇 》
達生

仲尼適楚,出於林中,見痀僂者承蜩,猶掇之也。仲尼曰:「子巧乎?有道邪?」曰:「我有道也。五六月累丸,二而不墜,則失者錙銖;累三而不墜,則失者十一;累五而不墜,猶掇之也。吾處身也若厥株拘,吾執臂也若槁木之枝,雖天地之大,萬物之多,而唯蜩翼之知。吾不反不側,不以萬物易蜩之翼,何為而不得!」孔子顧謂弟子曰:「用志不分,乃凝於神,其痀僂丈人之謂乎!」

顏淵問仲尼曰:「吾嘗濟乎觴深之淵,津人操舟若神。吾問焉,曰:『操舟可學邪?』曰:『可。善游者數能。若乃夫沒人,則未嘗見舟而便操之也。』吾問焉而不吾告,敢問何謂也?」仲尼曰:「善游者數能,忘水也。若乃夫沒人之未嘗見舟而便操之也,彼視淵若陵,視舟之覆猶其車卻也。覆卻萬方陳乎前而不得入其舍,惡往而不暇!以瓦注者巧,以鉤注者憚,以黃金注者殙。其巧一也,而有所矜,則重外也。凡外重者內拙。」

梓慶削木為鐻,鐻成,見者驚猶鬼神。魯侯見而問焉,曰:「子何術以為焉?」對曰:「臣工人,何術之有!雖然,有一焉。臣將為鐻,未嘗敢以耗氣也,必齊以靜心。齊三日,而不敢懷慶賞爵祿;齊五日,不敢懷非譽巧拙;齊七日,輒然忘吾有四枝形體也。當是時也,無公朝,其巧專而外骨消;然後入山林,觀天性;形軀至矣 ,然後成見鐻,然後加手焉;不然則已。則以天合天,器之所以疑神者,其是與?」

 

痀僂志一處凝神、津人忘水無所矜、梓慶齋心以天合皆是『技』可通『道』者也。方法萬千,主題則一。其神只能神會,其巧只能自得,其器只能己鑄。故無言可說,但請讀 Michael Nielsen 先生之文自然能了︰

Other techniques for regularization

There are many regularization techniques other than L2 regularization. In fact, so many techniques have been developed that I can’t possibly summarize them all. In this section I briefly describe three other approaches to reducing overfitting: L1 regularization, dropout, and artificially increasing the training set size. We won’t go into nearly as much depth studying these techniques as we did earlier. Instead, the purpose is to get familiar with the main ideas, and to appreciate something of the diversity of regularization techniques available.

……

Summing up: We’ve now completed our dive into overfitting and regularization. Of course, we’ll return again to the issue. As I’ve mentioned several times, overfitting is a major problem in neural networks, especially as computers get more powerful, and we have the ability to train larger networks. As a result there’s a pressing need to develop powerful regularization techniques to reduce overfitting, and this is an extremely active area of current work.

───

 

 

 

 

 

 

 

 

 

 

 

 

端午節

撒灰除蟲是禳毒驅疫的習俗

端午除蟲

 

夏小正》五月

參則見。參也者,伐星也,故盡其辭也。

浮游有殷。殷,眾也。浮游,殷之時也。浮游者,渠略也,朝生而莫死。稱「有」,何也?有見也。

鴃則鳴。鴃者,百鷯也。鳴者,相命也。其不辜之時也,是善之,故盡其辭也。

時有養日。養,長也。一則在本,一則在末,故其記曰「時養日」云也。

乃瓜。乃者,急瓜之辭也。瓜也者,始食瓜也。

良蜩鳴。良蜩也者,五采具。

匽之興,五日翕,望乃伏。其不言「生」而稱「興」,何也?不知其生之時,故曰「興」。以其興也,故言之「興」。五日翕也。望也者,月之望也。而伏云者,不知其死也,故謂之「伏」。五日也者,十五日也。翕也者,合也。伏也者,入而不見也。

啟灌藍蓼。啟者,別也,陶而疏之也。灌也者,聚生者也。記時也 。

鳩為鷹。

唐蜩鳴。唐蜩者,匽也。

初昏大火中。大火者,心也。心中,種黍、菽、糜時也。

煮梅。為豆實也。

蓄蘭。為沐浴也。

菽糜。以在經中,又言之時,何也?是食矩關而記之。

頒馬。分夫婦之駒也。

將閒諸則。或取離駒納之法則也。

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】四下

《呂氏春秋‧慎行論》
疑似

使人大迷惑者,必物之相似也。玉人之所患,患石之似玉者;相劍者之所患,患劍之似吳干者;賢主之所患,患人之博聞辯言而似通者。亡國之主似智,亡國之臣似忠。相似之物,此愚者之所大惑,而聖人之所加慮也。故墨子見歧道而哭之。

周宅酆鎬近戎人,與諸侯約,為高葆禱於王路,置鼓其上,遠近相聞。即戎寇至,傳鼓相告,諸侯之兵皆至救天子。戎寇當至,幽王擊鼓,諸侯之兵皆至,褒姒大說,喜之。幽王欲褒姒之笑也,因數擊鼓,諸侯之兵數至而無寇。至於後戎寇真至,幽王擊鼓,諸侯兵不至。幽王之身,乃死於麗山之下,為天下笑。此夫以無寇失真寇者也。賢者有小惡以致大惡。褒姒之敗,乃令幽王好小說以致大滅 。故形骸相離,三公九卿出走,此褒姒之所用死,而平王所以東徙也,秦襄、晉文之所以勞王勞而賜地也。

梁北有黎丘部,有奇鬼焉,喜效人之子姪昆弟之狀。邑丈人有之市而醉歸者,黎丘之鬼效其子之狀,扶而道苦之。丈人歸,酒醒而誚其子,曰:「吾為汝父也,豈謂不慈哉?我醉,汝道苦我,何故? 」其子泣而觸地曰:「孽矣!無此事也。昔也往責於東邑人可問也 。」其父信之,曰:「譆!是必夫奇鬼也,我固嘗聞之矣。」明日端復飲於市,欲遇而刺殺之。明旦之市而醉,其真子恐其父之不能反也,遂逝迎之。丈人望其真子,拔劍而刺之。丈人智惑於似其子者,而殺於真子。夫惑於似士者而失於真士,此黎丘丈人之智也。疑似之跡,不可不察。察之必於其人也。舜為御,堯為左,禹為右 ,入於澤而問牧童,入於水而問漁師,奚故也?其知之審也。夫人子之相似者,其母常識之,知之審也。

 

Michael Nielsen 先生之讚嘆與期許

This is particularly galling because in everyday life, we humans generalize phenomenally well. Shown just a few images of an elephant a child will quickly learn to recognize other elephants. Of course, they may occasionally make mistakes, perhaps confusing a rhinoceros for an elephant, but in general this process works remarkably accurately. So we have a system – the human brain – with a huge number of free parameters. And after being shown just one or a few training images that system learns to generalize to other images. Our brains are, in some sense, regularizing amazingly well! How do we do it? At this point we don’t know. I expect that in years to come we will develop more powerful techniques for regularization in artificial neural networks, techniques that will ultimately enable neural nets to generalize well even from small data sets.

 

雖文猶在目,苦於難以親見也!遂生『疑惑』焉?那『神經網絡』當真可擬似『人腦』乎??『正則化』 regularization 果然能比類『一般化』耶!!故特用『雜訊』欲一管窺豹,不過『疑似』迷惑更大的哩!!??

pi@raspberrypi:~/test/neural-networks-and-deep-learning/src $ python
Python 2.7.9 (default, Mar  8 2015, 00:52:26) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network
>>> net = network.Network([784, 30, 10])
>>> npzfile = network.np.load("swb.npz")
>>> npzfile.files
['s', 'b2', 'w2', 'w1', 'b1']
>>> net.weights[0] = npzfile["w1"]
>>> net.weights[1] = npzfile["w2"]
>>> net.biases[0] = npzfile["b1"]
>>> net.biases[1] = npzfile["b2"]
>>> net.evaluate(test_data=test_data)
9474
>>> import matplotlib.pyplot as plt
# 【五原圖】
>>> img = training_data[0][0].reshape(28,28)
>>> plt.imshow(img,cmap='Greys', interpolation='nearest')
<matplotlib.image.AxesImage object at 0x74018290>
>>> plt.show()
>>> imgc = img.reshape(784,1)
>>> network.np.argmax(net.feedforward(imgc))
5

# 【加 0.2 雜訊】
>>> imgn02 = img + 0.2 * network.np.random.random(img.shape)
>>> plt.imshow(imgn02,cmap='Greys', interpolation='nearest')
<matplotlib.image.AxesImage object at 0x741cd370>
>>> plt.show()
>>> imgn02c = imgn02.reshape(784,1)
>>> network.np.argmax(net.feedforward(imgn02c))
5

# 【加 0.8 雜訊】
>>> imgn08 = img + 0.8 * network.np.random.random(img.shape)
>>> plt.imshow(imgn08,cmap='Greys', interpolation='nearest')
<matplotlib.image.AxesImage object at 0x73ff5470>
>>> plt.show()
>>> imgn08c = imgn08.reshape(784,1)
>>> network.np.argmax(net.feedforward(imgn08c))
5

# 【加 1.0 雜訊】
>>> imgn10 = img + 1.0 * network.np.random.random(img.shape)
>>> plt.imshow(imgn10,cmap='Greys', interpolation='nearest')
<matplotlib.image.AxesImage object at 0x6e53b810>
>>> plt.show()
>>> imgn10c = imgn10.reshape(784,1)
>>> network.np.argmax(net.feedforward(imgn10c))
3

# 【加 1.2 雜訊】
>>> imgn12 = img + 1.2 * network.np.random.random(img.shape)
>>> plt.imshow(imgn12,cmap='Greys', interpolation='nearest')
<matplotlib.image.AxesImage object at 0x6e57abb0>
>>> plt.show()
>>> imgn12c = imgn12.reshape(784,1)
>>> network.np.argmax(net.feedforward(imgn12c))
5
>>> 

 

【五原圖】辨識為 5

HW5-origin

 

【加 0.2 雜訊】辨識為 5

HW5-02

 

【加 0.8 雜訊】辨識為 5

HW5-08

 

【加 1.0 雜訊】辨識為 3

HW5-10

 

【加 1.2 雜訊】辨識為 5

HW5-12

 

能相信『雜訊』也有『顏色』的嗎?之所以會得到這個結果是因為『numpy.random.random』有顏色?是一種『連續型均勻分布』的乎??如果用『numpy.random.normal』製造『白雜訊』就能真相大白的耶??!!

白雜訊

白雜訊,是一種功率譜密度為常數的隨機訊號隨機過程。即,此訊號在各個頻段上的功率是一樣的。由於白光是由各種頻率(顏色 )的單色光混合而成,因而此訊號的這種具有平坦功率譜的性質被稱作是「白色的」,此訊號也因此被稱作白雜訊。相對的,其他不具有這一性質的雜訊訊號被稱為有色雜訊

理想的白雜訊具有無限頻寬,因而其能量是無限大,這在現實世界是不可能存在的。實際上,我們常常將有限頻寬的平整訊號視為白雜訊,以方便進行數學分析。

White_noise_spectrum

白雜訊功率譜

統計特性

術語白雜訊也常用於表示在相關空間的自相關為0的空域雜訊訊號,於是訊號在空間頻率域內就是「白色」的,對於角頻率域內的訊號也是這樣,例如夜空中向各個角度發散的訊號。右面的圖片顯示了計算機產生的一個有限長度的離散時間白雜訊過程。

需要指出,相關性和機率分布是兩個不相關的概念。「白色」僅意味著訊號是不相關的,白雜訊的定義除了要求均值為零外並沒有對訊號應當服從哪種機率分布作出任何假設。因此,如果某白雜訊過程服從高斯分布,則它是「高斯白雜訊」。類似的,還有泊松白雜訊、柯西白雜訊等。人們經常將高斯白雜訊與白雜訊相混同,這是不正確的認識。根據中心極限定理,高斯白雜訊是許多現實世界過程的一個很好的近似,並且能夠生成數學上可以跟蹤的模型,這些模型用得如此頻繁以至於加性高斯白雜訊成了一個標準的縮寫詞: AWGN。此外,高斯白雜訊有著非常有用的統計學特性,因為高斯變量的獨立性與不相關性等價

白雜訊是維納過程或者布朗運動的廣義均方導數(generalized mean-square derivative)。

白雜訊的數學期望為0:

μ n = E { n ( t ) } = 0

自相關函數狄拉克δ函數

r n n = E { n ( t ) n ( t − τ ) } = δ ( τ )

上式正是對白雜訊的「白色」性質在時域的描述。由於隨機過程的功率譜密度是其自相關函數的傅立葉變換,而δ函數的傅立葉變換為常數,因此白雜訊的功率譜密度是平坦的。

Noise

頻譜圖上顯示的左邊的粉紅雜訊和右邊的白雜訊

……

數學定義

白色隨機向量

一個隨機向量 w 為一個白色隨機向量若且唯若它的平均值函數與自相關函數滿足以下條件:

μ w = E { w } = 0
R w w = E { w w T } = σ 2 I

意即它是一個平均值為零的隨機向量,並且它的自相關函數單位矩陣的倍數。

白色隨機過程(白雜訊)

一個時間連續隨機過程 w ( t ) where t ∈ R 為一個白雜訊若且唯若它的平均值函數與自相關函數滿足以下條件:

μ w ( t ) = E { w ( t ) } = 0
R w w ( t 1 , t 2 ) = E { w ( t 1 ) w ( t 2 ) } = ( N 0 / 2 ) δ ( t 1 − t 2 )

意即它是一個對所有時間其平均值為零的隨機過程,並且它的自相關函數是狄拉克δ函數,有無限大的功率。

由上述自相關函數可推出以下的功率譜密度。

S x x ( ω ) = ( N 0 / 2 )

由於δ函數的傅立葉變換為1。而對於所有頻率來說,此功率譜密度是一樣的。因此這是對白雜訊之「白色」性質在頻域的表述。

───

 

此『隨機』之事作者因沒有研究故不知,所以也只能王顧左右而言它的了︰

在現今的世界裡.『搞哲學的』是個貶抑的形容詞,多年前,東方大哲唐君毅先生說︰讀哲學,以後沒飯吃!而西方大哲康德先生,則流傳著一個故事︰話說鄰家女子喜歡康德,請父親替她提婚,康德回答讓他想想,用著哲學的辦法東想西想想了一年,當最後願娶時,伊人早已別嫁。但是避的開哲學,卻躲不了哲學問題,現今的宇宙論大霹靂學說,把『時間』的難題又向前推進一步,這個學說談著 0^{+}以後的事。那 0  秒之時0^{-}  秒之前呢?0^{-} 那時根本沒有『時間』這回事,又怎麽談『』或『』呢?『有始』帶來麻煩,『無始』一樣難解,試問︰果然無窮的過去,又怎麽能到現在?所以說『』與『』是否能畫『』?或者能談此『似有若無』的『介面』?大哉問?或許帶著酒興的詩人崔護,偶過長安城南郊,……此事略能捕捉到這個『有無交接』之處,並『創造』了出如此出色的一首詩︰

题都城南庄

去年今日此門中,

人面桃花相映紅。

人面不知何處去,

桃花依舊笑春風。

難道我們應該認為『創造力』能夠『無中生有』?【易繫辭】上說,易太極』……,或許當是『從有生有』;其中第二章專講『制器尚象』,讓我們聽聽古人怎麼說︰

古者包犧氏之王天下也,仰則觀象于天,俯則觀法于地,觀鳥獸之 文地之宜。近取諸身,遠取諸物,於是始作八卦,以神明之德,以萬物之情。作結繩而為網罟,以佃以漁,蓋取諸離 ☲☲。包犧氏沒,神農氏作。斲木為耜,揉木為耒,耒耨之利,以教天下 ,蓋取諸益 ☴☳。日中為市,致天下之民,聚天下之貨,交易而退,各得其所,蓋取 諸噬嗑 ☲☲。神農氏沒,黃帝堯舜氏作。通其變使民不倦,神而化之使民宜之。 易窮則變,變則通,通則久,是以自天祐之,吉无不利。黃帝堯舜,垂衣裳而天下治,蓋取諸乾 ☰☰ 坤 ☷☷。刳木為舟,剡木為楫,舟楫之利,以濟不通,致遠,以利天下,蓋 取諸渙 ☴☵。服牛乘馬,引重致遠,以利天下,蓋取諸隨 ☱☳。重門擊柝,以待暴客,蓋取諸豫 ☳☷。斷木為杵,掘地為臼,臼杵之利,萬民以濟,蓋取諸小過 ☳☶。弦木為弧,剡木為矢,弧矢之利,以威天下,蓋取諸睽 ☲☱。上古穴居而野處,後世聖人易之以宮室,上棟下宇,以待風雨,蓋 取諸大壯 ☳☰ 。古之葬者,厚衣之以薪,葬之中野,不封不樹,喪期无數。後世聖 人,易之以棺槨,蓋取諸大過 ☱☴。 上古結繩而治,後世聖人易之以書契,百官以治,萬民以察,蓋取 諸夬 ☱☰

綜觀全文,好一部以『生活』為中心的工具發明器物應用,貫串了成千上萬年的歷史。在此我們並不打算解讀著這些『蓋取諸…』之卦,讀者可以試著自己『想象』。為什麼是『十三』個卦呢?古來『陰陽』合曆,一個太陽年 365.25 天,而以月亮為主的太陰年約為 354 天,這就是農曆『閏月』的由來,閏月的那一年有十三個月,作【易繫辭】者或想暗示著『春生夏長秋收冬藏』一個『大年』已經完成了?『觀象繫辭』是一個古老的傳統,中國的『』『』本身就承載著它;東漢許慎在『說文解字』上講的『象形、指示、會意、形聲、轉注、假借』,其實可以分成『擬象』── 象形、指示、會意 ──,『音象』──  形聲  ──,『用象』── 轉注、假借 ──,都是『事物』之象。『』的本意『』起來看,比方說︰懸象著明,莫大於『』『』;對於『沒有形狀』的事物,也用著之的方法,舉例說︰『』可見,天行健的『』是『日陽』的『德性』不可見,所以象之為 。然後有『』為天,君子終日乾乾 ── 健健 ──,的諸諸種種。不僅如此,於器物、建築、用品、…等等『實物』也常常有『』意,比方說北京的『紫禁城』就是依據風水之法,象著『九五至尊』而建造的。由此我們也許可以把『觀象』之法,看作人類心靈中『聯想法』的精煉

─── 摘自《制器尚象,恆其道。

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】四中

《呂氏春秋‧慎行論》
察傳

夫得言不可以不察,數傳而白為黑,黑為白。故狗似玃,玃似母猴 ,母猴似人,人之與狗則遠矣。此愚者之所以大過也。聞而審則為福矣,聞而不審,不若無聞矣。齊桓公聞管子於鮑叔,楚莊聞孫叔敖於沈尹筮,審之也,故國霸諸侯也。吳王聞越王句踐於太宰嚭,智伯聞趙襄子於張武,不審也,故國亡身死也。

凡聞言必熟論,其於人必驗之以理。魯哀公問於孔子曰:「樂正夔一足,信乎?」孔子曰:「昔者舜欲以樂傳教於天下,乃令重黎舉夔於草莽之中而進之,舜以為樂正。夔於是正六律,和五聲,以通八風,而天下大服。重黎又欲益求人,舜曰:『夫樂,天地之精也 ,得失之節也,故唯聖人為能和。樂之本也。夔能和之,以平天下 。若夔者一而足矣。』故曰夔一足,非一足也。」宋之丁氏,家無井而出溉汲,常一人居外。及其家穿井,告人曰:「吾穿井得一人 。」有聞而傳之者曰:「丁氏穿井得一人。」國人道之,聞之於宋君,宋君令人問之於丁氏,丁氏對曰:「得一人之使,非得一人於井中也。」求能之若此,不若無聞也。

子夏之晉,過衛,有讀史記者曰:「晉師三豕涉河。」子夏曰:「非也,是己亥也。夫『己』與『三』相近,『豕』與『亥』相似 。」至於晉而問之,則曰「晉師己亥涉河」也。辭多類非而是,多類是而非。是非之經,不可不分,此聖人之所慎也。然則何以慎?緣物之情及人之情以為所聞則得之矣。

 

古早的中國喜用『類比』來論事說理,難道這就是『科學』不興的原因嗎?李約瑟在其大著《中國的科學與文明》試圖解決這個今稱『李約瑟難題』之大哉問!終究還是百家爭鳴也?若是比喻的說︰一個孤立隔絕系統之演化,常因內部機制的折衝協調,周遭環境之影響相對的小很多。因此秦之『大一統』,歷代的『戰亂』頻起,能不達於『社會』之『平衡』的耶??如此『主流價值』亦是已然確立成為『文化內涵』的吧!!所以『天不變』、『道不變』,人亦『不變』乎!!??雖然李約瑟曾經明示『類比』── 關聯式 corelative thinking 思考 ── 難以建立完整的『邏輯體系』, 或是『科學』不興的理由耶??!!如果『自然事物』之『邏輯推理』能形成系統『大樹』,那麼『類比關聯』將創造體系『森林』矣,豈可不『慎察』也。

類比英語:Analogy,源自古希臘語ἀναλογία,analogia,意為等比例的),或類推,是一種認知過程,將某個特定事物所附帶的訊息轉移到其他特定事物之上。類比通過比較兩件事情,清楚揭示二者之間的相似點,並將已知事物的特點,推衍到未知事物中,但兩者不一定有實質上的同源性,其類比也不見得「合理」。在記憶溝通與問題解決等過程中扮演重要角色;於不同學科中也有各自的定義。

舉例而言,原子中的原子核以及由電子組成的軌域,可類比成太陽系行星環繞太陽的樣子。除此之外,修辭學中的譬喻法有時也是一種類比,例如將月亮比喻成銀幣。生物學中因趨同演化而形成的的同功或同型解剖構造,例如哺乳類爬行類鳥類翅膀也是類似概念。

───

Analogy

Analogy (from Greek ἀναλογία, analogia, “proportion”[1][2]) is a cognitive process of transferring information or meaning from a particular subject (the analogue or source) to another (the target), or a linguistic expression corresponding to such a process. In a narrower sense, analogy is an inference or an argument from one particular to another particular, as opposed to deduction, induction, and abduction, where at least one of the premises or the conclusion is general. The word analogy can also refer to the relation between the source and the target themselves, which is often, though not necessarily, a similarity, as in the biological notion of analogy.

Analogy plays a significant role in problem solving such as, decision making, perception, memory, creativity, emotion, explanation, and communication. It lies behind basic tasks such as the identification of places, objects and people, for example, in face perception and facial recognition systems. It has been argued that analogy is “the core of cognition”.[3] Specific analogical language comprises exemplification, comparisons, metaphors, similes, allegories, and parables, but not metonymy. Phrases like and so on, and the like, as if, and the very word like also rely on an analogical understanding by the receiver of a message including them. Analogy is important not only in ordinary language and common sense (where proverbs and idioms give many examples of its application) but also in science, philosophy, and the humanities. The concepts of association, comparison, correspondence, mathematical and morphological homology, homomorphism, iconicity, isomorphism, metaphor, resemblance, and similarity are closely related to analogy. In cognitive linguistics, the notion of conceptual metaphor may be equivalent to that of analogy.

Analogy has been studied and discussed since classical antiquity by philosophers, scientists, and lawyers. The last few decades have shown a renewed interest in analogy, most notably in cognitive science.

420px-Bohr_atom_model_English.svg

Rutherford’s model of the atom (modified by Niels Bohr) made an analogy between the atom and the solar system.

───

 

將如何了解 Michael Nielsen 先生所言之『雜訊』的呢?

Noise is a variety of sound, usually meaning any unwanted sound.

Noise may also refer to:

Random or unwanted signals

 

假使藉著『取樣原理』將 MNIST 『手寫阿拉伯數字』看成『函式』

Whittaker–Shannon interpolation formula

The Whittaker–Shannon interpolation formula or sinc interpolation is a method to construct a continuous-time bandlimited function from a sequence of real numbers. The formula dates back to the works of E. Borel in 1898, and E. T. Whittaker in 1915, and was cited from works of J. M. Whittaker in 1935, and in the formulation of the Nyquist–Shannon sampling theorem by Claude Shannon in 1949. It is also commonly called Shannon’s interpolation formula and Whittaker’s interpolation formula. E. T. Whittaker, who published it in 1915, called it the Cardinal series.

Definition

Given a sequence of real numbers, x[n], the continuous function

x(t) = \sum_{n=-\infty}^{\infty} x[n] \, {\rm sinc}\left(\frac{t - nT}{T}\right)\,

(where “sinc” denotes the normalized sinc function) has a Fourier transform, X(f), whose non-zero values are confined to the region |f| ≤ 1/(2T).  When parameter T has units of seconds, the bandlimit, 1/(2T), has units of cycles/sec (hertz). When the x[n] sequence represents time samples, at interval T, of a continuous function, the quantity fs = 1/T is known as the sample rate, and fs/2 is the corresponding Nyquist frequency. When the sampled function has a bandlimit, B, less than the Nyquist frequency, x(t) is a perfect reconstruction of the original function. (See Sampling theorem.) Otherwise, the frequency components above the Nyquist frequency “fold” into the sub-Nyquist region of X(f), resulting in distortion. (See Aliasing.)

240px-Bandlimited.svg

Fourier transform of a bandlimited function.

─── 摘自《勇闖新世界︰ W!o《卡夫卡村》變形祭︰品味科學‧教具教材‧【專題】 PD‧箱子世界‧取樣

 

自然可藉著

變分法

變分法是處理泛函數學領域,和處理函數的普通微積分相對。譬如,這樣的泛函可以通過未知函數的積分和它的導數來構造。變分法最終尋求的是極值函數:它們使得泛函取得極大或極小值。有些曲線上的經典問題採用這種形式表達:一個例子是最速降線,在重力作用下一個粒子沿著該路徑可以在最短時間從點A到達不直接在它底下的一點B。在所有從A到B的曲線中必須極小化代表下降時間的表達式。

變分法的關鍵定理是歐拉-拉格朗日方程。它對應於泛函的臨界點。在尋找函數的極大和極小值時,在一個解附近的微小變化的分析給出一階的一個近似。它不能分辨是找到了最大值或者最小值(或者都不是)。

變分法在理論物理中非常重要:在拉格朗日力學中,以及在最小作用量原理量子力學的應用中。變分法提供了有限元方法的數學基礎,它是求解邊界值問題的強力工具。它們也在材料學中研究材料平衡中大量使用。而在純數學中的例子有,黎曼調和函數中使用狄利克雷原理

同樣的材料可以出現在不同的標題中,例如希爾伯特空間技術,莫爾斯理論,或者辛幾何變分一詞用於所有極值泛函問題。微分幾何中的測地線的研究是很顯然的變分性質的領域。極小曲面肥皂泡)上也有很多研究工作,稱為普拉托問題

───

 

用『任意鄰近函式』 \delta x(t) = \epsilon \cdot \eta (t)的概念

Euler–Lagrange equation

Finding the extrema of functionals is similar to finding the maxima and minima of functions. The maxima and minima of a function may be located by finding the points where its derivative vanishes (i.e., is equal to zero). The extrema of functionals may be obtained by finding functions where the functional derivative is equal to zero. This leads to solving the associated Euler–Lagrange equation.[Note 3]

Consider the functional

 J[y] = \int_{x_1}^{x_2} L(x,y(x),y'(x))\, dx \, .

where

x1, x2 are constants,
y (x) is twice continuously differentiable,
y ′(x) = dy / dx  ,
L(x, y (x), y ′(x)) is twice continuously differentiable with respect to its arguments x,  y,  y.

If the functional J[y ] attains a local minimum at f , and η(x) is an arbitrary function that has at least one derivative and vanishes at the endpoints x1 and x2 , then for any number ε close to 0,

J[f] \le J[f + \varepsilon \eta] \, .

The term εη is called the variation of the function f and is denoted by δf .[11]

Substituting  f + εη for y  in the functional J[ y ] , the result is a function of ε,

 \Phi(\varepsilon) = J[f+\varepsilon\eta] \, .

Since the functional J[ y ] has a minimum for y = f , the function Φ(ε) has a minimum at ε = 0 and thus,[Note 4]

 \Phi'(0) \equiv \left.\frac{d\Phi}{d\varepsilon}\right|_{\varepsilon = 0} = \int_{x_1}^{x_2} \left.\frac{dL}{d\varepsilon}\right|_{\varepsilon = 0} dx = 0 \, .

Taking the total derivative of L[x, y, y ′] , where y = f + ε η and y ′ = f ′ + ε η are functions of ε but x is not,

 \frac{dL}{d\varepsilon}=\frac{\partial L}{\partial y}\frac{dy}{d\varepsilon} + \frac{\partial L}{\partial y'}\frac{dy'}{d\varepsilon}

and since  dy / = η  and  dy ′/ = η’ ,

 \frac{dL}{d\varepsilon}=\frac{\partial L}{\partial y}\eta + \frac{\partial L}{\partial y'}\eta' .

Therefore,

where L[x, y, y ′] → L[x, f, f ′] when ε = 0 and we have used integration by parts. The last term vanishes because η = 0 at x1 and x2 by definition. Also, as previously mentioned the left side of the equation is zero so that

 \int_{x_1}^{x_2} \eta \left(\frac{\partial L}{\partial f} - \frac{d}{dx}\frac{\partial L}{\partial f'} \right) \, dx = 0 \, .

According to the fundamental lemma of calculus of variations, the part of the integrand in parentheses is zero, i.e.

 \frac{\part L}{\part f} -\frac{d}{dx} \frac{\part L}{\part f'}=0

which is called the Euler–Lagrange equation. The left hand side of this equation is called the functional derivative of J[f] and is denoted δJ/δf(x) .

In general this gives a second-order ordinary differential equation which can be solved to obtain the extremal function f(x) . The Euler–Lagrange equation is a necessary, but not sufficient, condition for an extremum J[f]. A sufficient condition for a minimum is given in the section Variations and sufficient condition for a minimum.

───

 

來考察所有『辨識』為某數的『手寫阿拉伯數字』之『相似性』

mnist_test4

 

也可探究因『權重』 weight 之『隨機賦值』或『圖像雜訊』所可能引發的現象與解決之發想乎?

Total variation denoising

In signal processing, Total variation denoising, also known as total variation regularization is a process, most often used in digital image processing, that has applications in noise removal. It is based on the principle that signals with excessive and possibly spurious detail have high total variation, that is, the integral of the absolute gradient of the signal is high. According to this principle, reducing the total variation of the signal subject to it being a close match to the original signal, removes unwanted detail whilst preserving important details such as edges. The concept was pioneered by Rudin et al. in 1992.[1]

This noise removal technique has advantages over simple techniques such as linear smoothing or median filtering which reduce noise but at the same time smooth away edges to a greater or lesser degree. By contrast, total variation denoising is remarkably effective at simultaneously preserving edges whilst smoothing away noise in flat regions, even at low signal-to-noise ratios.[2]

 ROF_Denoising_Example
Example of application of the Rudin et al.[1] total variation denoising technique to an image corrupted by Gaussian noise. This example created using demo_tv.m by Guy Gilboa, see external links.

Mathematical exposition for 1D digital signals

For a digital signal y_n, we can, for example, define the total variation as:

V(y) = \sum\limits_n\left|y_{n+1}-y_n \right|

Given an input signal x_n, the goal of total variation denoising is to find an approximation, call it y_n, that has smaller total variation than x_n but is “close” to x_n. One measure of closeness is the sum of square errors:

E(x,y) = \frac{1}{2}\sum\limits_n\left(x_n - y_n\right)^2

So the total variation denoising problem amounts to minimizing the following discrete functional over the signal y_n:

E(x,y) + \lambda V(y)

By differentiating this functional with respect to y_n, we can derive a corresponding Euler–Lagrange equation, that can be numerically integrated with the original signal x_n as initial condition. This was the original approach.[1] Alternatively, since this is a convex functional, techniques from convex optimization can be used to minimize it and find the solution y_n.[3]

TVD_1D_Example

Application of 1D total variation denoising to a signal obtained from a single-molecule experiment.[3] Gray is the original signal, black is the denoised signal.

Regularization properties

The regularization parameter \lambda plays a critical role in the denoising process. When \lambda=0, there is no denoising and the result is identical to the input signal. As \lambda \to \infty, however, the total variation term plays an increasingly strong role, which forces the result to have smaller total variation, at the expense of being less like the input (noisy) signal. Thus, the choice of regularization parameter is critical to achieving just the right amount of noise removal.

───

 

植種大樹,走入森林,方知

縱使宇宙萬有同源,萬象表現實在是錯綜複雜耶!!方了

世間書籍雖然汗牛充棟,原創概念往往卻沒有幾個??

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【轉折點】四上

這段文字是本章重點, Michael Nielsen 先生大筆一揮之作。其中的重要性不在於說故事,也不在於談『簡單』與『複雜』到底是誰對誰錯!屬於『信念』的歸『信念』、應該『實證』的得『實證』、有所『懷疑』的當『懷疑』!!如是懷著『科學之心』浸蘊、累積『經驗』久了或能『真知』乎??

Why does regularization help reduce overfitting?

We’ve seen empirically that regularization helps reduce overfitting. That’s encouraging but, unfortunately, it’s not obvious why regularization helps! A standard story people tell to explain what’s going on is along the following lines: smaller weights are, in some sense, lower complexity, and so provide a simpler and more powerful explanation for the data, and should thus be preferred. That’s a pretty terse story, though, and contains several elements that perhaps seem dubious or mystifying. Let’s unpack the story and examine it critically. To do that, let’s suppose we have a simple data set for which we wish to build a model:

Fig-1

Implicitly, we’re studying some real-world phenomenon here, with x and y representing real-world data. Our goal is to build a model which lets us predict y as a function of x. We could try using neural networks to build such a model, but I’m going to do something even simpler: I’ll try to model y as a polynomial in x. I’m doing this instead of using neural nets because using polynomials will make things particularly transparent. Once we’ve understood the polynomial case, we’ll translate to neural networks. Now, there are ten points in the graph above, which means we can find a unique 9th-order polynomial y = a_0 x^9 + a_1 x^8 + \ldots + a_9 which fits the data exactly. Here’s the graph of that polynomial*

*I won’t show the coefficients explicitly, although they are easy to find using a routine such as Numpy’s polyfit. You can view the exact form of the polynomial in the source code for the graph if you’re curious. It’s the function p(x) defined starting on line 14 of the program which produces the graph.:

Fig-2

That provides an exact fit. But we can also get a good fit using the linear model y = 2x:

Fig-3

Which of these is the better model? Which is more likely to be true? And which model is more likely to generalize well to other examples of the same underlying real-world phenomenon?

These are difficult questions. In fact, we can’t determine with certainty the answer to any of the above questions, without much more information about the underlying real-world phenomenon. But let’s consider two possibilities: (1) the 9th order polynomial is, in fact, the model which truly describes the real-world phenomenon, and the model will therefore generalize perfectly; (2) the correct model is y = 2x, but there’s a little additional noise due to, say, measurement error, and that’s why the model isn’t an exact fit.

It’s not a priori possible to say which of these two possibilities is correct. (Or, indeed, if some third possibility holds). Logically, either could be true. And it’s not a trivial difference. It’s true that on the data provided there’s only a small difference between the two models. But suppose we want to predict the value of y corresponding to some large value of x, much larger than any shown on the graph above. If we try to do that there will be a dramatic difference between the predictions of the two models, as the 9th order polynomial model comes to be dominated by the x^9 term, while the linear model remains, well, linear.

One point of view is to say that in science we should go with the simpler explanation, unless compelled not to. When we find a simple model that seems to explain many data points we are tempted to shout “Eureka!” After all, it seems unlikely that a simple explanation should occur merely by coincidence. Rather, we suspect that the model must be expressing some underlying truth about the phenomenon. In the case at hand, the model y = 2x+{\rm noise} seems much simpler than y = a_0 x^9 + a_1 x^8 + \ldots. It would be surprising if that simplicity had occurred by chance, and so we suspect that y = 2x+{\rm noise} expresses some underlying truth. In this point of view, the 9th order model is really just learning the effects of local noise. And so while the 9th order model works perfectly for these particular data points, the model will fail to generalize to other data points, and the noisy linear model will have greater predictive power.

Let’s see what this point of view means for neural networks. Suppose our network mostly has small weights, as will tend to happen in a regularized network. The smallness of the weights means that the behaviour of the network won’t change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn the effects of local noise in the data. Think of it as a way of making it so single pieces of evidence don’t matter too much to the output of the network. Instead, a regularized network learns to respond to types of evidence which are seen often across the training set. By contrast, a network with large weights may change its behaviour quite a bit in response to small changes in the input. And so an unregularized network can use large weights to learn a complex model that carries a lot of information about the noise in the training data. In a nutshell, regularized networks are constrained to build relatively simple models based on patterns seen often in the training data, and are resistant to learning peculiarities of the noise in the training data. The hope is that this will force our networks to do real learning about the phenomenon at hand, and to generalize better from what they learn.

With that said, this idea of preferring simpler explanation should make you nervous. People sometimes refer to this idea as “Occam’s Razor“, and will zealously apply it as though it has the status of some general scientific principle. But, of course, it’s not a general scientific principle. There is no a priori logical reason to prefer simple explanations over more complex explanations. Indeed, sometimes the more complex explanation turns out to be correct.

Let me describe two examples where more complex explanations have turned out to be correct.

……

There are three morals to draw from these stories. First, it can be quite a subtle business deciding which of two explanations is truly “simpler”. Second, even if we can make such a judgment, simplicity is a guide that must be used with great caution! Third, the true test of a model is not simplicity, but rather how well it does in predicting new phenomena, in new regimes of behaviour.

With that said, and keeping the need for caution in mind, it’s an empirical fact that regularized neural networks usually generalize better than unregularized networks. And so through the remainder of the book we will make frequent use of regularization. I’ve included the stories above merely to help convey why no-one has yet developed an entirely convincing theoretical explanation for why regularization helps networks generalize. Indeed, researchers continue to write papers where they try different approaches to regularization, compare them to see which works better, and attempt to understand why different approaches work better or worse. And so you can view regularization as something of a kludge. While it often helps, we don’t have an entirely satisfactory systematic understanding of what’s going on, merely incomplete heuristics and rules of thumb.

There’s a deeper set of issues here, issues which go to the heart of science. It’s the question of how we generalize. Regularization may give us a computational magic wand that helps our networks generalize better, but it doesn’t give us a principled understanding of how generalization works, nor of what the best approach is*

*These issues go back to the problem of induction, famously discussed by the Scottish philosopher David Hume in “An Enquiry Concerning Human Understanding” (1748). The problem of induction has been given a modern machine learning form in the no-free lunch theorem (link) of David Wolpert and William Macready (1997)..

……

Let me conclude this section by returning to a detail which I left unexplained earlier: the fact that L2 regularization doesn’t constrain the biases. Of course, it would be easy to modify the regularization procedure to regularize the biases. Empirically, doing this often doesn’t change the results very much, so to some extent it’s merely a convention whether to regularize the biases or not. However, it’s worth noting that having a large bias doesn’t make a neuron sensitive to its inputs in the same way as having large weights. And so we don’t need to worry about large biases enabling our network to learn the noise in our training data. At the same time, allowing large biases gives our networks more flexibility in behaviour – in particular, large biases make it easier for neurons to saturate, which is sometimes desirable. For these reasons we don’t usually include bias terms when regularizing.

───

 

事實上  Michael Nielsen 先生字裡行間透露出過去以來的科學傳統,來自偉大科學家如何進行『科學活動』之『作為典範』。畢竟科學絕非僅止『經驗公式』而已,『理論』之建立,『假說』之設置是為著『理解自然』,故而即使已知許多『現象』是『非線性』的,反倒是先將『線性系統』給研究個徹底。如是方能對比『非線性』系統真實何謂耶??!!

何謂『線性系統』? 假使從『系統論』的觀點來看,一個物理系統 S,如果它的『輸入輸出』或者講『刺激響應』滿足

設使 I_m(\cdots, t) \Rightarrow_{S} O_m(\cdots, t)I_n(\cdots, t) \Rightarrow_{S} O_n(\cdots, t)

那 麼\alpha \cdot I_m(\cdots, t) + \beta \cdot I_n(\cdots, t)  \Rightarrow_{S}  \alpha  \cdot O_m(\cdots, t) +  \beta \cdot O_n(\cdots, t)

也就是說一個線線系統︰無因就無果、小因得小果,大因得大果 ,眾因所得果為各因之果之總計。

如果一個線性系統還滿足

\left[I_m(\cdots, t) \Rightarrow_{S} O_m(\cdots, t)\right]  \Rightarrow_{S} \left[I_m(\cdots, t + \tau) \Rightarrow_{S} O_m(\cdots, t + \tau)\right]

,這個系統稱作『線性非時變系統』。系統中的『因果關係』是『恆常的』不隨著時間變化,因此『遲延之因』生『遲延之果』 。線性非時變 LTI Linear time-invariant theory 系統論之基本結論是

任何 LTI 系統都可以完全祇用一個單一方程式來表示,稱之為系統的『衝激響應』。系統的輸出可以簡單表示為輸入信號與系統的『衝激響應』的『卷積』Convolution 。

300px-Tangent-calculus.svg

220px-Anas_platyrhynchos_with_ducklings_reflecting_water

350px-PhaseConjugationPrinciple.en.svg

雖然很多的『基礎現象』之『物理模型』可以用  LTI 系統來描述。即使已經知道一個系統是『非線性』的,將它在尚未解出之『所稱解』── 比方說『熱力平衡』時 ── 附近作系統的『線性化』處理,以了解這個系統在『那時那裡』的行為,卻是常有之事。

科技理論上偏好『線性系統』 ,並非只是為了『數學求解』的容易性,尤其是在現今所謂的『雲端計算』時代,祇是一般『數值解答』通常不能提供『深入理解』那個『物理現象』背後的『因果機制』的原由,所以用著『線性化』來『解析』系統『局部行為』,大概也是『不得不』的吧!就像『混沌現象』與『巨變理論』述說著『自然之大,無奇不有』,要如何『詮釋現象』難道會是『不可說』的嗎??

一般物理上所謂的『疊加原理』 Superposition Principle 就是說該系統是一個線性系統。物理上還有一個『局部原理』Principle of Locality 是講︰一個物體的『運動』與『變化』,只會受到它『所在位置』的『周遭影響』。所以此原理排斥『超距作用』,因此『萬有引力』為『廣義相對論』所取代;且電磁學的『馬克士威方程式』取消了『庫倫作用力』。這也就是許多物理學家很在意『量子糾纏』的原因!俗語說『好事不出門, 壞事傳千里』是否是違背了『局部原理』的呢??

蘇格蘭的哲學家大衛‧休謨 David Hume 經驗論大師,一位徹底的懷疑主義者,反對『因果原理』Causality,認為因果不過是一種『心理感覺』。好比奧地利‧捷克物理學家恩斯特‧馬赫 Ernst Mach  在《Die Mechanik in ihrer Entwicklung, Historisch-kritisch dargestellt》一書中講根本不需要『萬有引力』 之『』與『』 ,直接說任何具有質量的兩物間,會有滿足

m_1 \frac{d^2 {\mathbf r}_1 }{ dt^2} = -\frac{m_1 m_2 g ({\mathbf r}_1 - {\mathbf r}_2)}{ |{\mathbf r}_1 - {\mathbf r}_2|^3};\; m_2 \frac{d^2 {\mathbf r}_2 }{dt^2} = -\frac{m_1 m_2 g ({\mathbf r}_2 - {\mathbf r}_1) }{ |{\mathbf r}_2 - {\mathbf r}_1|^3}

方程組的就好了;他進一步講牛頓所說的『』根本是『贅語』,那不過只是物質間的一種『交互作用』interaction 罷了!當真是『緣起性空。萬法歸一,一歸於宗。』的嗎??

─── 摘自《【Sonic π】聲波之傳播原理︰原理篇《四中》