W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】四

『圓一徑三』說何事?天 ○ 地 □ 道分明。這『一』就是圓之徑,那『三』就是

圓周率

圓周率,定義為圓的周長與直徑比值。一般以π來表示,是一個在數學物理學普遍存在的數學常數,是精確計算周長、圓面積 體積幾何量的關鍵值\pi\,也等於圓的面積與半徑平方的比值。

分析學裡,\pi \,可以嚴格定義為滿足\sin(x)=0\,的最小正實數x\,,這裡的\sin\,正弦函數(採用分析學的定義)。

 

『天文曆法』務準確,『日月五星』時會合,『割圓術』興實必然 ,『歐拉』神氣大哉論,

巴塞爾問題』是一個著名的『數論問題』,最早由『皮耶特羅‧門戈利』在一六四四年所提出。由於這個問題難倒了以前許多的數學家,因此一七三五年,當『歐拉』一解出這個問題後,他馬上就出名了,當時『歐拉』二十八歲。他把這個問題作了一番推廣,他的想法後來被『黎曼』在一八五九年的論文《論小於給定大數的質數個 數》 On the Number of Primes Less Than a Given Magnitude中所採用,論文中定義了『黎曼ζ函數』,並證明了它的一些基本的性質。那麼為什麼今天稱之為『巴塞爾問題』的呢?因為『此處』這個『巴塞爾』,它正是『歐拉』和『伯努利』之家族的『家鄉』。那麼就這麽樣的一個『級數的和\sum \limits_{n=1}^\infty \frac{1}{n^2} = \lim \limits_{n \to +\infty}\left(\frac{1}{1^2} + \frac{1}{2^2} + \cdots + \frac{1}{n^2}\right) 能有什麼『重要性』的嗎?即使僅依據『發散級數』 divergent series 的『可加性』 summable  之『歷史』而言,或又得再過了百年的時間之後,也許早已經是『柯西』之『極限觀』天下後『再議論』的了!!因是我們總該看看『歷史』上『歐拉』自己的『論證』的吧!!

220px-PI.svg
巴塞爾問題
\sum_{n=1}^{\infty}\frac{1}{n^2} = \frac{\pi^2}{6}

220px-Euler-10_Swiss_Franc_banknote_(front)

220px-Euler_GDR_stamp

Euler-USSR-1957-stamp

169px-Euler_Diagram.svg
邏輯之歐拉圖

假使說『三角函數』  \sin{x} 可以表示為 \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots,那麼『除以x 後,將會得到 \frac{\sin(x)}{x} = 1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots,然而 \sin{x} 的『』是 x = n\cdot\pi,由於『除以x 之緣故,因此 n \neq 0,所以 n = \pm1, \pm2, \pm3, \dots,那麼 \frac{\sin(x)}{x} 應該會『等於\left(1 - \frac{x}{\pi}\right)\left(1 + \frac{x}{\pi}\right)\left(1 - \frac{x}{2\pi}\right)\left(1 + \frac{x}{2\pi}\right)\left(1 - \frac{x}{3\pi}\right)\left(1 + \frac{x}{3\pi}\right) \cdots,於是也就『等於\left(1 - \frac{x^2}{\pi^2}\right)\left(1 - \frac{x^2}{4\pi^2}\right)\left(1 - \frac{x^2}{9\pi^2}\right) \cdots,若是按造『牛頓恆等式』,考慮 x^2 項的『係數』, 就會有 - \left(\frac{1}{\pi^2} + \frac{1}{4\pi^2} + \frac{1}{9\pi^2} + \cdots \right) = -\frac{1}{\pi^2}\sum_{n=1}^{\infty}\frac{1}{n^2},然而 \frac{\sin(x)}{x}  之『 x^2』的『係數』 是『- \frac{1}{3!} = -\frac{1}{6}』,所以 -\frac{1}{6} = -\frac{1}{\pi^2}\sum \limits_{n=1}^{\infty}\frac{1}{n^2},於是 \sum \limits_{n=1}^{\infty}\frac{1}{n^2} = \frac{\pi^2}{6}。那麼『歐拉』是『』的嗎?還是他還是『』的呢??

─── 摘自《【Sonic π】電聲學之電路學《四》之《 V!》‧下中

 

\pi 之名號響天際。

採用π為符號

現時所知,最早使用希臘字母π代表圓周率,是威爾斯數學家威廉·瓊斯的1706年著作《Synopsis Palmariorum Matheseos; or, a New Introduction to the Mathematics》。[24]書中首次出現希臘字母π,是討論半徑為1的圓時,在短語「1/2 Periphery (π)」之中。[25]他選用π,或許由於π是periphery(周邊)的希臘語對應單詞περιφέρεια的首字母。

然而,其他數學家未立刻跟從,有時數學家會用c, p等字母代表圓周率。[26]將π的這個用法推廣出去的,是數學家歐拉。他在1736年的著作《Mechanica》開始使用π。因為歐拉與歐洲其他數學家通信頻繁,這樣就把π的用法迅速傳播。[26] 1748年,歐拉在廣受閱讀的名著《無窮小分析引論》(Introductio in analysin infinitorum)使用π。他寫道:「為簡便故,我們將這數記為π,因此π=半徑為1的圓的半周長,換言之π是180度弧的長度。」於是π就在西方世界得到普遍接受。[26][27]

 

雖然 e^{i \pi} + 1 =0 猶在目,世間轉眼批評起。

批評

近年來,有部分學者認為約等於3.14的π「不合自然」,應該用雙倍於π、約等於6.28的一個常數代替。支持這一說法的學者認為在很多數學公式2π很常見,很少單獨使用一個π。美國哈佛大學物理學教授的邁克爾·哈特爾稱「圓形與直徑無關,而與半徑相關,圓形由離中心一定距離即半徑的一系列點構成」。並建議使用希臘字母τ來代替π[28][29][30]

美國數學家鮑勃·帕萊(Bob Palais)於2001年在《數學情報》(The Mathematical Intelligencer)上發表了一篇題為《π 是錯誤的!》(π Is Wrong!)的論文。在論文的第一段,鮑勃·帕萊說道:

幾個世紀以來,π 受到了無限的推崇和讚賞。數學家們歌頌 π 的偉大與神秘,把它當作數學界的象徵;計算器和程式設計語言裡也少不了 π 的身影;甚至有 一部電影 就直接以它命名⋯⋯但是,π 其實只是一個冒牌貨,真正值得大家敬畏和讚賞的,其實應該是一個不幸被我們稱作 2π 的數。

美國數學家麥克·哈特爾(Michael Hartl) 建立了網站 tauday.com ,呼籲人們用希臘字母 τ(發音:tau)來表示「正確的」圓周率 C/r。並建議大家以後在寫論文時,用一句「為方便起見,定義 τ = 2π 」開頭。

著名的 Geek 漫畫網站 spikedmath.com 建立了 thepimanifesto.com ,裡邊有一篇洋洋灑灑數千字的 π 宣言,反駁支持τ的言論,宣稱圓周率定義為周長與直徑之比有優越性,並認為在衡量圓柱形物體的截面大小時,直徑比半徑更方便測量。

 

千年長河萬年水,釀成『山巔一寺一壺酒』!

文化

背誦

世界記錄是100,000位,日本人原口證於2006年10月3日背誦圓周率π至小數點後100,000位。[31]

普通話用諧音記憶的有「山巔一寺一壺酒,爾樂苦煞吾,把酒吃,酒殺爾,殺不死,樂而樂」,就是3.1415926535897932384626。 另一諧音為:「山巔一石一壺酒,二妞舞扇舞,把酒沏酒搧又搧,飽死囉」,就是3.14159265358979323846。

英文, 會使用英文字母的長度作為數字,例如「How I want a drink, alcoholic of course, after the heavy lectures involving quantum mechanics. All of the geometry, Herr Planck, is fairly hard, and if the lectures were boring or tiring, then any odd thinking was on quartic equations again.」就是3.1415926535897932384626433832795。

───

 

若問 \pi 何物也??『科技術數』 STEM 之歷史旗幟乎!!如是看來『派』 Pi 當真是『養生之道』耶??!!

或許 Michael Nielsen 先生長於『養智』,善於『啟發』學者,故而『習題』比『正文』內容還多,大概希望讀者『動手用腦』的吧︰

Let’s return to the toy example we played with earlier, and explore what happens when we use the cross-entropy instead of the quadratic cost. To re-orient ourselves, we’ll begin with the case where the quadratic cost did just fine, with starting weight 0.6 and starting bias 0.9. Press “Run” to see what happens when we replace the quadratic cost by the cross-entropy:

Unsurprisingly, the neuron learns perfectly well in this instance, just as it did earlier. And now let’s look at the case where our neuron got stuck before (link, for comparison), with the weight and bias both starting at 2.0:

Success! This time the neuron learned quickly, just as we hoped. If you observe closely you can see that the slope of the cost curve was much steeper initially than the initial flat region on the corresponding curve for the quadratic cost. It’s that steepness which the cross-entropy buys us, preventing us from getting stuck just when we’d expect our neuron to learn fastest, i.e., when the neuron starts out badly wrong.

I didn’t say what learning rate was used in the examples just illustrated. Earlier, with the quadratic cost, we used \eta = 0.15. Should we have used the same learning rate in the new examples? In fact, with the change in cost function it’s not possible to say precisely what it means to use the “same” learning rate; it’s an apples and oranges comparison. For both cost functions I simply experimented to find a learning rate that made it possible to see what is going on. If you’re still curious, despite my disavowal, here’s the lowdown: I used \eta = 0.005 in the examples just given.

You might object that the change in learning rate makes the graphs above meaningless. Who cares how fast the neuron learns, when our choice of learning rate was arbitrary to begin with?! That objection misses the point. The point of the graphs isn’t about the absolute speed of learning. It’s about how the speed of learning changes. In particular, when we use the quadratic cost learning is slower when the neuron is unambiguously wrong than it is later on, as the neuron gets closer to the correct output; while with the cross-entropy learning is faster when the neuron is unambiguously wrong. Those statements don’t depend on how the learning rate is set.

We’ve been studying the cross-entropy for a single neuron. However, it’s easy to generalize the cross-entropy to many-neuron multi-layer networks. In particular, suppose y = y_1, y_2, \ldots are the desired values at the output neurons, i.e., the neurons in the final layer, while a^L_1, a^L_2, \ldots are the actual output values. Then we define the cross-entropy by

\sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right]. \ \ \ \ (63)

This is the same as our earlier expression, Equation (57), except now we’ve got the \sum_j summing over all the output neurons. I won’t explicitly work through a derivation, but it should be plausible that using the expression (63) avoids a learning slowdown in many-neuron networks. If you’re interested, you can work through the derivation in the problem below.

When should we use the cross-entropy instead of the quadratic cost? In fact, the cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons. To see why, consider that when we’re setting up the network we usually initialize the weights and biases using some sort of randomization. It may happen that those initial choices result in the network being decisively wrong for some training input – that is, an output neuron will have saturated near 1, when it should be 0, or vice versa. If we’re using the quadratic cost that will slow down learning. It won’t stop learning completely, since the weights will continue learning from other training inputs, but it’s obviously undesirable.

Exercises

  • One gotcha with the cross-entropy is that it can be difficult at first to remember the respective roles of the ys and the as. It’s easy to get confused about whether the right form is -[y \ln a + (1-y) \ln (1-a)] or -[a \ln y + (1-a) \ln (1-y)]. What happens to the second of these expressions when y = 0 or 1? Does this problem afflict the first expression? Why or why not?
  • In the single-neuron discussion at the start of this section, I argued that the cross-entropy is small if \sigma(z) \approx y for all training inputs. The argument relied on y being equal to either 0 or 1. This is usually true in classification problems, but for other problems (e.g., regression problems) y can sometimes take values intermediate between 0 and 1. Show that the cross-entropy is still minimized when \sigma(z) = y for all training inputs. When this is the case the cross-entropy has the value:
    C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)]. \ \ \ \ (64)

    The quantity -[y \ln y+(1-y)\ln(1-y)] is sometimes known as the binary entropy.

Problems

  • Many-layer multi-neuron networks In the notation introduced in the last chapter, show that for the quadratic cost the partial derivative with respect to weights in the output layer is
    \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \sigma'(z^L_j). \ \ \ \ (65)

    The term \sigma'(z^L_j) causes a learning slowdown whenever an output neuron saturates on the wrong value. Show that for the cross-entropy cost the output error \delta^L for a single training example x is given by

    \delta^L = a^L-y. \ \ \ \ (66)

    Use this expression to show that the partial derivative with respect to the weights in the output layer is given by

    \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j). \ \ \ \ (67)

    The \sigma'(z^L_j) term has vanished, and so the cross-entropy avoids the problem of learning slowdown, not just when used with a single neuron, as we saw earlier, but also in many-layer multi-neuron networks. A simple variation on this analysis holds also for the biases. If this is not obvious to you, then you should work through that analysis as well.

  • Using the quadratic cost when we have linear neurons in the output layer Suppose that we have a many-layer multi-neuron network. Suppose all the neurons in the final layer are linear neurons, meaning that the sigmoid activation function is not applied, and the outputs are simply a^L_j = z^L_j. Show that if we use the quadratic cost function then the output error \delta^L for a single training example x is given by
    \delta^L = a^L-y. \ \ \ \ (68)

    Similarly to the previous problem, use this expression to show that the partial derivatives with respect to the weights and biases in the output layer are given by

    \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \ \ \ \ (69)

    \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x (a^L_j-y_j). \ \ \ \ (70)

     

    This shows that if the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown. In this case the quadratic cost is, in fact, an appropriate cost function to use.

───

 

因是,此處不宜再多說什麼,願人人能『築夢踏實』。

老子四十二章中講︰道生一,一生二,二生三,三生萬物。是說天地生萬物就像四季循環自然而然,如果『』或將成為『亦大』,就得知道大自然 『之道,能循本能『得一』。他固善於『觀水』,盛讚『上善若水』,卻也深知水為山堵之『』、人為慾阻之『』難,故於第三十九章中又講︰

得一者得一以得一以得一以得一以萬物得一以侯王得一以為天下貞其致之天無恐裂地無恐發神無恐歇谷無恐竭萬物無恐滅侯王無貞高恐蹶。故貴以賤為本高以下為基。是以侯王自謂孤寡不穀,此非以賤為本耶?非乎?人之所惡,唯孤寡不穀,而侯王以為稱。故致譽無譽不欲琭琭如玉,珞珞如石

,希望人們知道所謂『道德』之名,實在說的是『得到』── 得道── 的啊!!如果乾坤都『沒路』可走,人又該往向『何方』??

─── 摘自《跟隨□?築夢!!