W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】六

三十六計》南北朝‧檀道濟

瞞天過海

備周則意怠;常見則不疑。陰在陽之內,不在陰之外。太陽,太陰 。

唐太宗貞觀十七年,太宗領軍三十萬東征,太宗會暈船,薛仁貴怕皇上不敢過海而退兵,故假扮為一豪民,拜見唐太宗,邀請太宗文武百官到他家作客,豪民家飾以繡幔彩錦,環繞於室,好不漂亮,太宗與百官遂於豪民家飲酒作樂。不久,房室搖晃,杯酒落地,太宗等人驚嚇,揭開繡幔彩錦,發現他與三十萬大軍已在海上。古時皇帝自稱天子,故瞞「天」過海的天,指的是皇帝,此計遂稱為瞞天過海。

 

兵法講究『陰陽』,伺候打探『消息』,事件給予『情報』。常見則『發生頻率』高,因太普通故不生疑,認為少有『資訊價值』也 !若說有人能從此處建立『資訊理論』,當真是『資訊 bit 』比特值極高的乎?

克勞德·夏農

克勞德·艾爾伍德·夏農Claude Elwood Shannon,1916年4月30日-2001年2月26日),美國數學家電子工程師密碼學家,被譽為資訊理論的創始人。[1][2]夏農是密西根大學學士,麻省理工學院博士。

1948年,夏農發表了劃時代的論文——通訊的數學原理,奠定了現代資訊理論的基礎。不僅如此,夏農還被認為是數位計算機理論和數位電路設計理論的創始人。1937年,21歲的夏農是麻省理工學院的碩士研究生,他在其碩士論文中提出,將布爾代數應用於電子領域,能夠構建並解決任何邏輯和數值關係,被譽為有史以來最具水平的碩士論文之一[3]。二戰期間,夏農為軍事領域的密分碼析——密碼破譯和保密通訊——做出了很大貢獻。

───

 

無奈這把『資訊尺』用了許多 丈二金剛摸不着頭的『術語』───傳輸器、通道、接收器、雜訊源、熵、期望值、機率、資訊內容… ,維基百科詞條讀來宛若『天書』耶??

Entropy (information theory)

In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modifies the message in some way. The receiver attempts to infer which message was sent. In this context, entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. ‘Messages’ can be modeled by any flow of information.

In a more technical sense, there are reasons (explained below) to define information as the negative of the logarithm of the probability distribution. The probability distribution of the events, coupled with the information amount of every event, forms a random variable whose expected value is the average amount of information, or entropy, generated by this distribution. Units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to define it, though the shannon is commonly referred to as a bit.

The logarithm of the probability distribution is useful as a measure of entropy because it is additive for independent sources. For instance, the entropy of a coin toss is 1 shannon, whereas of m tosses it is m shannons. Generally, you need log2(n) bits to represent a variable that can take one of n values if n is a power of 2. If these values are equally probable, the entropy (in shannons) is equal to the number of bits. Equality between number of bits and shannons holds only while all outcomes are equally probable. If one of the events is more probable than others, observation of that event is less informative. Conversely, rarer events provide more information when observed. Since observation of less probable events occurs more rarely, the net effect is that the entropy (thought of as average information) received from non-uniformly distributed data is less than log2(n). Entropy is zero when one outcome is certain. Shannon entropy quantifies all these considerations exactly when a probability distribution of the source is known. The meaning of the events observed (the meaning of messages) does not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution, not the meaning of the events themselves.

Generally, entropy refers to disorder or uncertainty. Shannon entropy was introduced by Claude E. Shannon in his 1948 paper “A Mathematical Theory of Communication“.[1] Shannon entropy provides an absolute limit on the best possible average length of lossless encoding or compression of an information source. Rényi entropy generalizes Shannon entropy.

Definition

Named after Boltzmann’s Η-theorem, Shannon defined the entropy Η (Greek letter Eta) of a discrete random variable X with possible values {x1, …, xn} and probability mass function P(X) as:

\Eta(X) = \mathrm{E}[\mathrm{I}(X)] = \mathrm{E}[-\ln(\mathrm{P}(X))].

Here E is the expected value operator, and I is the information content of X.[4][5] I(X) is itself a random variable.

The entropy can explicitly be written as

\Eta(X) = \sum_{i=1}^n {\mathrm{P}(x_i)\,\mathrm{I}(x_i)} = -\sum_{i=1}^n {\mathrm{P}(x_i) \log_b \mathrm{P}(x_i)},

where b is the base of the logarithm used. Common values of b are 2, Euler’s number e, and 10, and the unit of entropy is shannon for b = 2, nat for b = e, and hartley for b = 10.[6] When b = 2, the units of entropy are also commonly referred to as bits.

In the case of p(xi) = 0 for some i, the value of the corresponding summand 0 logb(0) is taken to be 0, which is consistent with the limit:

\lim_{p\to0+}p\log (p) = 0.

When the distribution is continuous rather than discrete, the sum is replaced with an integral as

\Eta(X) = \int {\mathrm{P}(x)\,\mathrm{I}(x)} ~dx = -\int {\mathrm{P}(x) \log_b \mathrm{P}(x)} ~dx,

where P(x) represents a probability density function.

One may also define the conditional entropy of two events X and Y taking values xi and yj respectively, as

 \Eta(X|Y)=\sum_{i,j}p(x_{i},y_{j})\log\frac{p(y_{j})}{p(x_{i},y_{j})}

where p(xi, yj) is the probability that X = xi and Y = yj. This quantity should be understood as the amount of randomness in the random variable X given the event Y.

───

 

如果嘗試親讀夏農一九四八年的大著

A Mathematical Theory of Communication

Shannon

 

或可解惑矣。其中最大要點就在清楚明白『資訊函式』到底是什麼呢???

Rationale

To understand the meaning of pi log(1/pi), at first, try to define an information function, I, in terms of an event i with probability pi. How much information is acquired due to the observation of event i? Shannon’s solution follows from the fundamental properties of information:[7]

  1. I(p) ≥ 0 – information is a non-negative quantity
  2. I(1) = 0 – events that always occur do not communicate information
  3. I(p1 p2) = I(p1) + I(p2) – information due to independent events is additive

The last is a crucial property. It states that joint probability communicates as much information as two individual events separately. Particularly, if the first event can yield one of n equiprobable outcomes and another has one of m equiprobable outcomes then there are mn possible outcomes of the joint event. This means that if log2(n) bits are needed to encode the first value and log2(m) to encode the second, one needs log2(mn) = log2(m) + log2(n) to encode both. Shannon discovered that the proper choice of function to quantify information, preserving this additivity, is logarithmic, i.e.,

\mathrm{I}(p) = \log(1/p)

The base of the logarithm can be any fixed real number greater than 1. The different units of information (bits for log2, trits for log3, nats for the natural logarithm ln and so on) are just constant multiples of each other. (In contrast, the entropy would be negative if the base of the logarithm were less than 1.) For instance, in case of a fair coin toss, heads provides log2(2) = 1 bit of information, which is approximately 0.693 nats or 0.631 trits. Because of additivity, n tosses provide n bits of information, which is approximately 0.693n nats or 0.631n trits.

Now, suppose we have a distribution where event i can happen with probability pi. Suppose we have sampled it N times and outcome i was, accordingly, seen ni = N pi times. The total amount of information we have received is

\sum_i {n_i \mathrm{I}(p_i)} = \sum {N p_i \log(1/p_i)}.

The average amount of information that we receive with every event is therefore

\sum_i {p_i \log {1\over p_i}}.

───

 

彷彿早將『事件機率』看成了基本!!莫非越是『稀有』的事件p_1 \times  p_2 \times \cdots,『資訊』之內含 I(\frac {1}{p_1}) + I(\frac {1}{p_2}) + \cdots 越高的哩??讀者不要以為奇怪,此事無獨有偶,也許那代人都喜歡這麼論述也!!??就像『馮‧諾伊曼』之『理性溫度計』也是用『公設法』如是展開︰

當白努利提出了一個理論來解釋『聖彼得堡悖論』時,就開啟了『效用』 Utility 的大門︰

邊際效用遞減原理】:一個人對於『財富』的擁有多多益善,也就是說『效用函數U(w) 的一階導數大於零 \frac{dU(w)}{dw} > 0;但隨著『財富』的增加,『滿足程度』的積累速度卻是不斷下降,正因為『效用函數』之二階導數小於零 \frac{d^2U(w)}{dw^2} < 0

最大效用原理】:當人處於『風險』和『不確定』的條件下,一個人『理性決策』的『準則』是為著獲得最大化『期望效用』值而不是最大之『期望金額』值。

Utility』依據牛津大字典的『定義』是︰

The state of being useful, profitable, or beneficial:
(In game theory or economics) a measure of that which is sought to be maximized in any situation involving a choice.

如此『效用』一詞,不論代表的是哪種『喜好度』 ── 有用 useful 、有利 profitable 、滿足 Satisfaction 、愉快 Pleasure 、幸福 Happiness ──,都會涉及主觀的感覺,那麼真可以定出『尺度』的嗎?『效用函數』真的『存在』嗎??

170px-Pakkanen
溫度計
量冷熱

魯班尺
魯班尺
度吉凶

一九四七年,匈牙利之美籍猶太人數學家,現代電腦創始人之一。約翰‧馮‧諾伊曼 Jhon Von Neumann 和德國-美國經濟學家奧斯卡‧摩根斯特恩 Oskar Morgenstern 提出只要『個體』的『喜好性』之『度量』滿足『四條公設』,那麼『個體』之『效用函數』就『存在』,而且除了『零點』的『規定』,以及『等距長度』之『定義』之外,這個『效用函數』還可以說是『唯一』的。就像是『個體』隨身攜帶的『理性』之『溫度計』一樣,能在任何『選擇』下,告知最大『滿意度』與『期望值』。現今這稱之為『期望效用函數理論』 Expected Utility Theory。

由於每個人的『冷熱感受』不同,所以『溫度計』上的『刻度』並不是代表數學上的一般『數字』,通常這一種比較『尺度』只有『差距值』有相對『強弱』意義,『數值比值』並不代表什麼意義,就像說,攝氏二十度不是攝氏十度的兩倍熱。這一類『尺度』在度量中叫做『等距量表』 Interval scale 。

溫度計』量測『溫度』的『高低』,『理性』之『溫度計』度量『選擇』的『優劣』。通常在『實驗經濟學』裡最廣泛採取的是『彩票選擇實驗』 lottery- choice experiments,也就是講,請你在『眾多彩票』中選擇一個你『喜好』 的『彩票』。

這樣就可以將一個有多種『機率p_i,能產生互斥『結果A_i 的『彩票L 表示成︰

L = \sum \limits_{i=1}^{N} p_i A_i ,  \  \sum \limits_{i=1}^{N} p_i  =1,  \ i=1 \cdots N

如此『期望效用函數理論』之『四條公設』可以表示為︰

完整性公設】Completeness

L\prec MM\prec L,或 L \sim M

任意的兩張『彩票』都可以比較『喜好度』 ,它的結果只能是上述三種關係之一,『偏好 ML\prec M,『偏好 LM\prec L,『無差異L \sim M

遞移性公設】 Transitivity

如果 L \preceq M,而且 M \preceq N,那麼 L \preceq N

連續性公設】 Continuity

如果 L \preceq M\preceq N , 那麼存在一個『機率p\in[0,1] ,使得 pL + (1-p)N = M

獨立性公設】 Independence

如果 L\prec M, 那麼對任意的『彩票N 與『機率p\in(0,1],滿足 pL+(1-p)N \prec pM+(1-p)N

對於任何一個滿足上述公設的『理性經紀人』 rational agent ,必然可以『建構』一個『效用函數u,使得 A_i \rightarrow u(A_i),而且對任意兩張『彩票』,如果 L\prec M \Longleftrightarrow \  E(u(L)) < E(u(M))。此處 E(u(L)) 代表對 L彩票』的『效用期望值』,簡記作 Eu(L),符合

Eu(p_1 A_1 + \ldots + p_n A_n) = p_1 u(A_1) + \cdots + p_n u(A_n)

它在『微觀經濟學』、『博弈論』與『決策論』中,今天稱之為『預期效用假說』 Expected utility hypothesis,指在有『風險』的情況下,任何『個體』所應該作出的『理性選擇』就是追求『效用期望值』的『最大化』。假使人生中的『抉擇』真實能夠如是的『簡化』,也許想得到『快樂』與『幸福』的辦法,就清楚明白的多了。然而有人認為這個『假說』不合邏輯。一九五二年,法國總體經濟學家莫里斯‧菲力‧夏爾‧阿萊斯 Maurice Félix Charles Allais ── 一九八八年,諾貝爾經濟學獎的得主 ── 作了一個著名的實驗,看看實際上人到底是怎麼『做選擇』的,這個『阿萊斯』發明的『彩票選擇實驗』就是大名鼎鼎的『阿萊斯悖論』 Allais paradox 。

─── 摘自《物理哲學·下中…

 

這麼說來,某人擲 n 個硬幣,這一事件 0,1,1,0,0,0,1,1 \cdots ,此處 0 = 頭、1 =尾,將產生 I(2^{-n}) = \log_2(\frac{1}{2^{-n}} ) = n 比特資訊,會等於 n 個人各擲一硬幣之結果 n \times I(2^{-1}) = n \times \log_2(\frac{1}{2^{-1}} ) = n  ─── n 個一比特資訊 ── 實在是非常美妙的吧。

Entropy_flip_2_coins

2 shannons of entropy: Information entropy is the log-base-2 of the number of possible outcomes; with two coins there are four outcomes, and the entropy is two bits.

 

假使有人反對夏農之結論,就得反對他的『假設』了。

如果反面立論,假使依據夏農的觀點,一個幾乎篤定的事件 p \approx 1 ,它幾乎沒有資訊內容 I(p) = \log_2(\frac{1}{p}) \approx 0 的哩!如是對偶之 1-p ── 資訊『不確定性』 ── 也就自然成為一根『尺』矣!!

Binary entropy function

In information theory, the binary entropy function, denoted \operatorname H(p) or \operatorname H_\text{b}(p), is defined as the entropy of a Bernoulli process with probability of success p. Mathematically, the Bernoulli trial is modelled as a random variable X that can take on only two values: 0 and 1. The event X = 1 is considered a success and the event X = 0 is considered a failure. (These two events are mutually exclusive and exhaustive.)

If \operatorname{Pr}(X=1) = p, then \operatorname{Pr}(X=0) = 1-p and the entropy of X (in shannons) is given by

\operatorname H(X) = \operatorname H_\text{b}(p) = -p \log_2 p - (1 - p) \log_2 (1 - p),

where 0 \log_2 0 is taken to be 0. The logarithms in this formula are usually taken (as shown in the graph) to the base 2. See binary logarithm.

When p=\tfrac 1 2, the binary entropy function attains its maximum value. This is the case of the unbiased bit, the most common unit of information entropy.

\operatorname H(p) is distinguished from the entropy function \operatorname H(X) in that the former takes a single real number as a parameter whereas the latter takes a distribution or random variables as a parameter. Sometimes the binary entropy function is also written as \operatorname H_2(p). However, it is different from and should not be confused with the Rényi entropy, which is denoted as \operatorname H_2(X).

Binary_entropy_plot.svg

Entropy of a Bernoulli trial as a function of success probability, called the binary entropy function.

Explanation

In terms of information theory, entropy is considered to be a measure of the uncertainty in a message. To put it intuitively, suppose p=0. At this probability, the event is certain never to occur, and so there is no uncertainty at all, leading to an entropy of 0. If p=1, the result is again certain, so the entropy is 0 here as well. When p=1/2, the uncertainty is at a maximum; if one were to place a fair bet on the outcome in this case, there is no advantage to be gained with prior knowledge of the probabilities. In this case, the entropy is maximum at a value of 1 bit. Intermediate values fall between these cases; for instance, if p=1/4, there is still a measure of uncertainty on the outcome, but one can still predict the outcome correctly more often than not, so the uncertainty measure, or entropy, is less than 1 full bit.

───

 

事實上藉著

Logarithm

Integral representation of the natural logarithm

The natural logarithm of t equals the integral of 1/x dx from 1 to t:

\ln (t) = \int_1^t \frac{1}{x} \, dx.

601px-Natural_logarithm_integral.svg

The natural logarithm of t is the shaded area underneath the graph of the function f(x) = 1/x (reciprocal of x).

───

 

的定義 \ln(t) = \int_1^t \frac{1}{x} dx ,我們可以得到一個重要的不等式︰ \ln(t) \leq t - 1

【證明】

‧ 如果 t \geq 1 ,那麼 \frac{1}{t} \leq 1 ,所以

\int_1^t \frac{1}{x} dx \leq \int_1^t 1 dx = t -1

‧ 如果 t \leq 1 ,那麼 \frac{1}{t} \geq 1 ,所以

\int_t^1 \frac{1}{x} dx \geq \int_t^1 1 dx = 1-t ,因此

-\int_t^1 \frac{1}{x} dx = \int_1^t \frac{1}{x} dx \leq t - 1

 

此式直通『吉布斯不等式』的大門︰

Gibbs’ inequality

In information theory, Gibbs’ inequality is a statement about the mathematical entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs’ inequality, including Fano’s inequality. It was first presented by J. Willard Gibbs in the 19th century.

Gibbs’ inequality

Suppose that

 P = \{ p_1 , \ldots , p_n \}

is a probability distribution. Then for any other probability distribution

 Q = \{ q_1 , \ldots , q_n \}

the following inequality between positive quantities (since the pi and qi are positive numbers less than one) holds[1]:68

 - \sum_{i=1}^n p_i \log_2 p_i \leq - \sum_{i=1}^n p_i \log_2 q_i

with equality if and only if

 p_i = q_i \,

for all i. Put in words, the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q.

The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written:[2]:34

 D_{\mathrm{KL}}(P\|Q) \equiv \sum_{i=1}^n p_i \log_2 \frac{p_i}{q_i} \geq 0.

Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an “average surprisal” measured in bits.

Proof

Since

 \log_2 a = \frac{ \ln a }{ \ln 2 }

it is sufficient to prove the statement using the natural logarithm (ln). Note that the natural logarithm satisfies

 \ln x \leq x-1

for all x > 0 with equality if and only if x=1.

Let I denote the set of all i for which pi is non-zero. Then

- \sum_{i \in I} p_i \ln \frac{q_i}{p_i} \geq - \sum_{i \in I} p_i \left( \frac{q_i}{p_i} - 1 \right)
= - \sum_{i \in I} q_i + \sum_{i \in I} p_i
 = - \sum_{i \in I} q_i + 1 \geq 0.

So

 - \sum_{i \in I} p_i \ln q_i \geq - \sum_{i \in I} p_i \ln p_i

and then trivially

 - \sum_{i=1}^n p_i \ln q_i \geq - \sum_{i=1}^n p_i \ln p_i

since the right hand side does not grow, but the left hand side may grow or may stay the same.

For equality to hold, we require:

  1.  \frac{q_i}{p_i} = 1 for all i \in I so that the approximation \ln \frac{q_i}{p_i} = \frac{q_i}{p_i} -1 is exact.
  2.  \sum_{i \in I} q_i = 1 so that equality continues to hold between the third and fourth lines of the proof.

This can happen if and only if

p_i = q_i

for i = 1, …, n.

───

 

於是乎可得

- \sum \limits_{i=1}^{n} p_i \ ln(p_i) \leq - \sum \limits_{i=1}^{n} p_i \ ln(\frac{1}{n}) = ln(n)  \ \sum \limits_{i=1}^{n} p_i = ln(n)

誠令人驚訝也???

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】五

若說想了解『熵』 Entropy 之內涵?

化學熱力學中所指的英語:Entropy[3],是一種測量在動力學方面不能做能量總 數,也就是當總體的熵增加,其做功能力也下降,熵的量度正是能量退化的指標。熵亦被用於計算一個系統中的失序現象,也就是計算該系統混亂的程度。熵是一個 描述系統狀態的函數,但是經常用熵的參考值和變化量進行分析比較,它在控制論、機率論、數論、天體物理、生命科學等領域都有重要應用,在不同的學科中也有 引申出的更為具體的定義,是各領域十分重要的參量。

190px-Ice_water

熔冰——増熵的古典例子[1] 1862年被魯道夫·克勞修斯描寫為冰塊中分子分散性的増加[2]

 

得先知道『對數』 Log 的性質,大概奇也怪哉!那位『納皮爾』之遠見淵源流長,或許絕非純屬意外!!??

一六一四年 John Napier 約翰‧納皮爾在一本名為《 Mirifici Logarithmorum Canonis Descriptio  》── 奇妙的對數規律的描述 ── 的書中,用了三十七頁解釋『對數log ,以及給了長達九十頁的對數表。這有什麼重要的嗎?想一想即使在今天用『鉛筆』和『紙』做大位數的加減乘除,尚且困難也很容易算錯,就可以知道對數的發明,對計算一事貢獻之大的了。如果用一對一對應的觀點來看,對數把『乘除』運算『變換加減』運算

\log {a * b} = \log{a} + \log{b}

\log {a / b} = \log{a} - \log{b}

,更不要說還可以算『平方』、『立方』種種和開『平方根』、『立方根』等等的計算了。

\log {a^n} = n * \log{a}

傳聞納皮爾還發明了的『骨頭計算器』,他的書對於之後的天文學 、力學、物理學、占星學的發展都有很大的影響。他的運算變換 Transform 的想法,開啟了『換個空間解決數學問題』的大門 ,比方『常微分方程式的  Laplace Transform』與『頻譜分析的傅立葉變換』等等。

這個對數畫起來是這個樣子︰

Rendered by QuickLaTeX.com

不只如此這個對數關係竟然還跟人類之『五官』── 眼耳鼻舌身 ── 受到『刺激』── 色聲香味觸 ── 的『感覺』強弱大小有關 。一七九五年出生的 Ernst Heinrich Weber 韋伯,一位德國物理學家,是一位心理物理學的先驅,他提出感覺之『方可分辨』JND just-noticeable difference 的特性。比方說你提了五公斤的水,再加上半公斤,可能感覺差不了多少,要是你沒提水,說不定會覺的突然拿著半公斤的水很重。也就是說在『既定的刺激』下, 感覺的方可分辨性大小並不相同。韋伯實驗後歸結成一個關係式︰

ΔR/R = K

R:  既有刺激之物理量數值
ΔR:  方可分辨 JND 所需增加的刺激之物理量數值
K: 特定感官之常數,不同的感官不同

。之後  Gustav Theodor Fechner  費希納,一位韋伯派的學者,提出『知覺』perception 『連續性假設,將韋伯關係式改寫為︰

dP = k  \frac {dS}{S}

,求解微分方程式得到︰

P = k \ln S + C

假如刺激之物理量數值小於 S_0 時,人感覺不到 P = 0,就可將上式寫成︰

P = k \ln \frac {S}{S_0}

這就是知名的韋伯-費希納定律,它講著:在絕對閾限 S_0 之上,主觀知覺之強度的變化與刺激之物理量大小的改變呈現自然對數的關係,也可以說,如果刺激大小按著幾何級數倍增,所引起的感覺強度卻只依造算術級數累加。

─── 摘自《千江有水千江月

 

這個『對數』的『大域』以及『微觀』性質可見諸於下︰

如果從『恆等式』identity 的『觀點』來看,『 泛函數方程式』可以看成是『泛函數恆等式』 functional identities,就像{[\sin{x}]}^2 + {[\cos{x}]}^2 = 1 這個 『三角恆等式』 一樣,假使我們藉由上式將 \sin{(x + y)} = \sin{(x)} \cos{(y)} + \cos{(x)} \sin{(y)} 恆等式改寫成 \sin{(x + y)} = \sin{(x)} \sqrt{1 - {[\sin{y}]}^2} + \sqrt{1 - {[\sin{x}]}^2} \sin{(y)},儼然是一個『 泛函數方程式』的了!因此我們也可以用『相同』的『觀點』將『微分方程式』看成是一種『泛函數恆等式』,進一步『明白』即使『不求解』那個方程式,我們依然能夠藉之得到有關『解函數』的許多重要有用的『資訊』的啊!!

之前我們曾用『均值定理

一個實數函數 f 在閉區間 [a, b] 裡『連續』且於開區間 [a, b] 中『可微分』,那麼一定存在一點 c, \ a < c < b 使得此點的『切線斜率』等於兩端點間的『割線斜率』,即 f^{\prime}(c) = \frac{f(b) - f(a)}{b - a}

論證了『劉維爾定理』。這個『均值定理』的重要性在於,它將一個『連續』而且『可微分』的『函數』的『區間端點割線』與『區間內切線』聯繫了起來,使我們可以『確定』一個『等式』的『存在』。就讓我們再舉一個『對數性函數f(x \cdot y) = f(x) + f(y) 的例子,看看它的『運用』 吧。首先 f(1) = f(1 \cdot 1) = f(1) + f(1) \Longrightarrow f(1) = 0,其次 f(x \cdot \frac{1}{x}) = f(1) = 0 = f(x) + f(\frac{1}{x}) \Longrightarrow f(\frac{1}{x}) = - f(x),所以 f(\frac{x}{y}) = f(x \cdot \frac{1}{y}) = f(x) + f(\frac{1}{y}) = f(x) -f(y)。因此

f(x + \delta x) - f(x) = f(\frac{x + \delta x}{\delta x}) = f(1 + \frac{\delta x}{ x})

= f^{\prime}(\eta) \left[(1 + \frac{\delta x}{x}) - 1 \right], \ \eta \in (1, 1 + \delta x)

= f^{\prime}(\eta) \frac{\delta x}{x}

,為什麼呢?因為 f(x) 在『閉區間[1, 1 + \delta x]是『平滑的』,按照『均值定理』,存在一個 \eta \in (1, 1+ \delta x) 使得

f^{\prime}(\eta) = \frac{f( 1 + \frac{\delta x}{ x}) - f(1)}{(1 + \frac{\delta x}{x})  - 1} = \frac{f( 1 + \frac{\delta x}{ x})}{ \frac{\delta x}{x}}

\therefore f(x + \delta x) = f(x) +  f^{\prime}(\eta) \frac{\delta x}{x} = f(x) + f^{\prime}(x) \cdot \delta x + \epsilon \cdot \delta x,於是我們可以得到

f^{\prime}(x) = \frac{f^{\prime}(\eta)}{x} - \epsilon,也就是說『函數f(x) 滿足

f^{\prime}(x) = \frac{k}{x} , \ f(1)= 0, \ k= f^{\prime}(1)

它的『』果真就是 f(x) = k \ln{(x)} 的啊!!

─── 摘自《【Sonic π】電聲學之電路學《四》之《一》

 

如是當知 ln(x) 之『導數』為 \frac{1}{x} 的乎?似乎宇宙中有著一『大數因緣』!!此所以 \bigcirc \cdot ln(\bigcirc) 『形式』頗為常見??如果畫一圖象,

Figure xlnx

 

怎麼瞧來像《形象的叛逆》之不是煙斗的『煙斗』??!!如何知那煙嘴不冒煙 \lim \limits_{x \to 0} x \log{x} = 0 ,但問『對數』之弟兄『指數』 exp 耶?!假設 x = e^{-t} ,故曉

\lim \limits_{x \to 0} x \log{x} =\lim \limits_{e^{-t} \to 0} e^{-t} \log \ e^{-t} = \lim \limits_{t \to \infty } \frac{- t}{e^t} = 0

矣!!!更別說它們還能將『階乘』一把抓???

斯特靈公式

斯特靈公式是一條用來取n階乘近似值數學公式。一般來說,當n很大的時候,n階乘的計算量十分大,所以斯特靈公式十分好用,而且,即使在n很小的時候,斯特靈公式的取值已經十分準確。

公式為:

n! \approx \sqrt{2\pi n}\, \left(\frac{n}{e}\right)^{n}.

這就是說,對於足夠大的整數n,這兩個數互為近似值。更加精確地:

\lim_{n \rightarrow \infty} {\frac{n!}{\sqrt{2\pi n}\, \left(\frac{n}{e}\right)^{n}}} = 1

\lim_{n \rightarrow \infty} {\frac{e^n\, n!}{n^n \sqrt{n}}} = \sqrt{2 \pi}.

……

推導

這個公式,以及誤差的估計,可以推導如下。我們不直接估計n!,而是考慮它的自然對數

\ln(n!) = \ln 1 + \ln 2 + \cdots + \ln n.

這個方程的右面是積分\int_1^n \ln(x)\,dx = n \ln n - n + 1的近似值(利用梯形法則),而它的誤差由歐拉-麥克勞林公式給出:

\ln (n!) - \frac{\ln n}{2} = \ln 1 + \ln 2 + \cdots + \ln(n-1) + \frac{\ln n}{2} = n \ln n - n + 1 + \sum_{k=2}^{m} \frac{B_k {(-1)}^k}{k(k-1)} \left( \frac{1}{n^{k-1}} - 1 \right) + R_{m,n},

其中Bk伯努利數Rm,n是歐拉-麥克勞林公式中的餘項。取極限,可得:

\lim_{n \to \infty} \left( \ln n! - n \ln n + n - \frac{\ln n}{2} \right) = 1 - \sum_{k=2}^{m} \frac{B_k {(-1)}^k}{k(k-1)} + \lim_{n \to \infty} R_{m,n}.

我們把這個極限記為y。由於歐拉-麥克勞林公式中的餘項Rm,n滿足:

R_{m,n} = \lim_{n \to \infty} R_{m,n} + O \left( \frac{1}{n^{2m-1}} \right),

其中我們用到了大O符號,與以上的方程結合,便得出對數形式的近似公式:

\ln n! = n \ln \left( \frac{n}{e} \right) + \frac{\ln n}{2} + y + \sum_{k=2}^{m} \frac{B_k {(-1)}^k}{k(k-1)n^{k-1}} + O \left( \frac{1}{n^{2m-1}} \right).

兩邊取指數,並選擇任何正整數m,我們便得到了一個含有未知數ey的公式。當m=1時,公式為:

n! = e^{y} \sqrt{n}~{\left( \frac{n}{e} \right)}^n \left[ 1 + O \left( \frac{1}{n} \right) \right]

將上述表達式代入沃利斯乘積公式,並令n趨於無窮,便可以得出eye^y = \sqrt{2 \pi})。因此,我們便得出斯特靈公式:

n! = \sqrt{2 \pi n}~{\left( \frac{n}{e} \right)}^n \left[ 1 + O \left( \frac{1}{n} \right) \right]

這個公式也可以反覆使用分部積分法來得出,首項可以通過最速下降法得到。把以下的和

\ln(n!) = \sum_{j=1}^{n} \ln j

用積分近似代替,可以得出不含\sqrt{2 \pi n}的因子的斯特靈公式(這個因子通常在實際應用中無關):

\sum_{j=1}^{n} \ln j \approx \int_1^n \ln x \, dx = n\ln n - n + 1.

───

 

故深得『統計力學』之鍾愛也︰

Maxwell–Boltzmann statistics

 In statistical mechanics, Maxwell–Boltzmann statistics describes the average distribution of non-interacting material particles over various energy states in thermal equilibrium, and is applicable when the temperature is high enough or the particle density is low enough to render quantum effects negligible.The expected number of particles with energy \varepsilon_i for Maxwell–Boltzmann statistics is \langle N_i \rangle where:

 \langle N_i \rangle = \frac {g_i} {e^{(\varepsilon_i-\mu)/kT}} = \frac{N}{Z}\,g_i e^{-\varepsilon_i/kT}

where:

  • \varepsilon_i is the i-th energy level
  • \langle N_i \rangle is the number of particles in the set of states with energy \varepsilon_i
  • g_i is the degeneracy of energy level i, that is, the number of states with energy \varepsilon_i which may nevertheless be distinguished from each other by some other means.[nb 1]
  • μ is the chemical potential
  • k is Boltzmann’s constant
  • T is absolute temperature
  • N is the total number of particles
N=\sum_i N_i\,
Z=\sum_i g_i e^{-\varepsilon_i/kT}

Equivalently, the particle number is sometimes expressed as

 \langle N_i \rangle = \frac {1} {e^{(\varepsilon_i-\mu)/kT}} = \frac{N}{Z}\,e^{-\varepsilon_i/kT}

where the index i  now specifies a particular state rather than the set of all states with energy \varepsilon_i, and Z=\sum_i e^{-\varepsilon_i/kT}

……

Derivation from microcanonical ensemble

Suppose we have a container with a huge number of very small particles all with identical physical characteristics (such as mass, charge, etc.). Let’s refer to this as the system. Assume that though the particles have identical properties, they are distinguishable. For example, we might identify each particle by continually observing their trajectories, or by placing a marking on each one, e.g., drawing a different number on each one as is done with lottery balls.

The particles are moving inside that container in all directions with great speed. Because the particles are speeding around, they possess some energy. The Maxwell–Boltzmann distribution is a mathematical function that speaks about how many particles in the container have a certain energy.

In general, there may be many particles with the same amount of energy \varepsilon. Let the number of particles with the same energy \varepsilon_1 be N_1, the number of particles possessing another energy \varepsilon_2 be N_2, and so forth for all the possible energies {\varepsilon_i | i=1,2,3,…}. To describe this situation, we say that N_i is the occupation number of the energy level i. If we know all the occupation numbers {N_i | i=1,2,3,…}, then we know the total energy of the system. However, because we can distinguish between which particles are occupying each energy level, the set of occupation numbers {N_i | i=1,2,3,…} does not completely describe the state of the system. To completely describe the state of the system, or the microstate, we must specify exactly which particles are in each energy level. Thus when we count the number of possible states of the system, we must count each and every microstate, and not just the possible sets of occupation numbers.

To begin with, let’s ignore the degeneracy problem: assume that there is only one way to put N_i particles into the energy level i . What follows next is a bit of combinatorial thinking which has little to do in accurately describing the reservoir of particles.

The number of different ways of performing an ordered selection of one single object from N objects is obviously N. The number of different ways of selecting two objects from N objects, in a particular order, is thus N(N − 1) and that of selecting n objects in a particular order is seen to be N!/(N − n)!. It is divided by the number of permutations, n!, if order does not matter. The binomial coefficient, N!/(n!(N − n)!), is, thus, the number of ways to pick n objects from N. If we now have a set of boxes labelled a, b, c, d, e, …, k, then the number of ways of selecting Na objects from a total of N objects and placing them in box a, then selecting Nb objects from the remaining N − Na objects and placing them in box b, then selecting Nc objects from the remaining N − Na − Nb objects and placing them in box c, and continuing until no object is left outside is

and because not even a single object is to be left outside the boxes, implies that the sum made of the terms Na, Nb, Nc, Nd, Ne, …, Nk must equal N, thus the term (N – Na – Nb – Nc – … – Nl – Nk)! in the relation above evaluates to 0!. (0!=1) which makes possible to write down that relation as

Now going back to the degeneracy problem which characterizes the reservoir of particles. If the i-th box has a “degeneracy” of g_i, that is, it has g_i “sub-boxes”, such that any way of filling the i-th box where the number in the sub-boxes is changed is a distinct way of filling the box, then the number of ways of filling the i-th box must be increased by the number of ways of distributing the N_i objects in the g_i “sub-boxes”. The number of ways of placing N_i distinguishable objects in g_i “sub-boxes” is g_i^{N_i} (the first object can go into any of the g_i boxes, the second object can also go into any of the g_i boxes, and so on). Thus the number of ways W that a total of N particles can be classified into energy levels according to their energies, while each level i having g_i distinct states such that the i-th level accommodates N_i particles is:

W=N!\prod \frac{g_i^{N_i}}{N_i!}

This is the form for W first derived by Boltzmann. Boltzmann’s fundamental equation S=k\,\ln W relates the thermodynamic entropy S to the number of microstates W, where k is the Boltzmann constant. It was pointed out by Gibbs however, that the above expression for W does not yield an extensive entropy, and is therefore faulty. This problem is known as the Gibbs paradox. The problem is that the particles considered by the above equation are not indistinguishable. In other words, for two particles (A and B) in two energy sublevels the population represented by [A,B] is considered distinct from the population [B,A] while for indistinguishable particles, they are not. If we carry out the argument for indistinguishable particles, we are led to the Bose–Einstein expression for W:

W=\prod_i \frac{(N_i+g_i-1)!}{N_i!(g_i-1)!}

The Maxwell–Boltzmann distribution follows from this Bose–Einstein distribution for temperatures well above absolute zero, implying that g_i\gg 1. The Maxwell–Boltzmann distribution also requires low density, implying that g_i\gg N_i. Under these conditions, we may use Stirling’s approximation for the factorial:

 N! \approx N^N e^{-N},

to write:

W\approx\prod_i \frac{(N_i+g_i)^{N_i+g_i}}{N_i^{N_i}g_i^{g_i}}\approx\prod_i \frac{g_i^{N_i}(1+N_i/g_i)^{g_i}}{N_i^{N_i}}

Using the fact that (1+N_i/g_i)^{g_i}\approx e^{N_i} for g_i\gg N_i we can again use Stirlings approximation to write:

W\approx\prod_i \frac{g_i^{N_i}}{N_i!}

This is essentially a division by N! of Boltzmann’s original expression for W, and this correction is referred to as correct Boltzmann counting.

We wish to find the N_i for which the function W is maximized, while considering the constraint that there is a fixed number of particles \left(N=\textstyle\sum N_i\right) and a fixed energy \left(E=\textstyle\sum N_i \varepsilon_i\right) in the container. The maxima of W and \ln(W) are achieved by the same values of N_i and, since it is easier to accomplish mathematically, we will maximize the latter function instead. We constrain our solution using Lagrange multipliers forming the function:

 f(N_1,N_2,\ldots,N_n)=\ln(W)+\alpha(N-\sum N_i)+\beta(E-\sum N_i \varepsilon_i)
 \ln W=\ln\left[\prod\limits_{i=1}^{n}\frac{g_i^{N_i}}{N_i!}\right] \approx \sum\limits_{i=1}^n\left(N_i\ln g_i-N_i\ln N_i + N_i\right)

Finally

 f(N_1,N_2,\ldots,N_n)=\alpha N +\beta E + \sum\limits_{i=1}^n\left(N_i\ln g_i-N_i\ln N_i + N_i-(\alpha+\beta\varepsilon_i) N_i\right)

In order to maximize the expression above we apply Fermat’s theorem (stationary points), according to which local extrema, if exist, must be at critical points (partial derivatives vanish):

 \frac{\partial f}{\partial N_i}=\ln g_i-\ln N_i -(\alpha+\beta\varepsilon_i) = 0

By solving the equations above (i=1\ldots n) we arrive to an expression for N_i:

 N_i = \frac{g_i}{e^{\alpha+\beta \varepsilon_i}}

Substituting this expression for N_i into the equation for \ln W and assuming that N\gg 1 yields:

\ln W = (\alpha+1) N+\beta E\,

or, rearranging:

E=\frac{\ln W}{\beta}-\frac{N}{\beta}-\frac{\alpha N}{\beta}

Boltzmann realized that this is just an expression of the Euler-integrated fundamental equation of thermodynamics. Identifying E as the internal energy, the Euler-integrated fundamental equation states that :

E=TS-PV+\mu N

where T is the temperature, P is pressure, V is volume, and μ is the chemical potential. Boltzmann’s famous equation S=k\,\ln W is the realization that the entropy is proportional to \ln W with the constant of proportionality being Boltzmann’s constant. Using the ideal gas equation of state (PV=NkT), It follows immediately that \beta=1/kT and \alpha=-\mu/kT so that the populations may now be written:

 N_i = \frac{g_i}{e^{(\varepsilon_i-\mu)/kT}}

Note that the above formula is sometimes written:

 N_i = \frac{g_i}{e^{\varepsilon_i/kT}/z}

where z=\exp(\mu/kT) is the absolute activity.

Alternatively, we may use the fact that

\sum_i N_i=N\,

to obtain the population numbers as

 N_i = N\frac{g_i e^{-\varepsilon_i/kT}}{Z}

where Z is the partition function defined by:

 Z = \sum_i g_i e^{-\varepsilon_i/kT}

In an approximation where εi is considered to be a continuous variable, the Thomas-Fermi approximation yields a continuous degeneracy g proportional to \sqrt{\varepsilon} so that:

 \frac{\sqrt{\varepsilon}\,e^{-\varepsilon/k T}}{\int_0^\infty\sqrt{\varepsilon}\,e^{-\varepsilon/k T}}

which is just the Maxwell-Boltzmann distribution for the energy.

───

 

『熵』之名義 S = k \ ln W 亦得而出焉。終因『貼標籤』問題

海盜船

忒修斯之船

希臘古羅馬時代的普魯塔克 Plutarch 引用古希臘傳說寫道︰

忒 修斯與雅典的年輕人們自克里特島歸來時,所搭之三十槳的船為雅典人留下來當做紀念碑。隨著時間流逝;木材逐漸腐朽,那時雅典人便會更換新的木頭來替代。終 於此船的每根木頭都已被替換過了;因此古希臘的哲學家們就開始問著:『這艘船還是原本的那艘忒修斯之船的嗎?假使是,但它已經沒有原本的任何一根木頭了; 如果不是,那它又是從什麼時候不是的呢?』

這個『同一性』identity 問題,在邏輯學上叫做『同一律』,與真假不相容的『矛盾律』齊名︰

\forall x, \ x = x

─── 摘自《Thue 之改寫系統《三》

 

導致了『吉布斯悖論』

Gibbs paradox

In statistical mechanics, a semi-classical derivation of the entropy that does not take into account the indistinguishability of particles, yields an expression for the entropy which is not extensive (is not proportional to the amount of substance in question). This leads to a paradox known as the Gibbs paradox, after Josiah Willard Gibbs. The paradox allows for the entropy of closed systems to decrease, violating the second law of thermodynamics. A related paradox is the “mixing paradox”. If one takes the perspective that the definition of entropy must be changed so as to ignore particle permutation, the paradox is averted.

───

 

,『歸因』之事能不慎乎!!!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】四

『圓一徑三』說何事?天 ○ 地 □ 道分明。這『一』就是圓之徑,那『三』就是

圓周率

圓周率,定義為圓的周長與直徑比值。一般以π來表示,是一個在數學物理學普遍存在的數學常數,是精確計算周長、圓面積 體積幾何量的關鍵值\pi\,也等於圓的面積與半徑平方的比值。

分析學裡,\pi \,可以嚴格定義為滿足\sin(x)=0\,的最小正實數x\,,這裡的\sin\,正弦函數(採用分析學的定義)。

 

『天文曆法』務準確,『日月五星』時會合,『割圓術』興實必然 ,『歐拉』神氣大哉論,

巴塞爾問題』是一個著名的『數論問題』,最早由『皮耶特羅‧門戈利』在一六四四年所提出。由於這個問題難倒了以前許多的數學家,因此一七三五年,當『歐拉』一解出這個問題後,他馬上就出名了,當時『歐拉』二十八歲。他把這個問題作了一番推廣,他的想法後來被『黎曼』在一八五九年的論文《論小於給定大數的質數個 數》 On the Number of Primes Less Than a Given Magnitude中所採用,論文中定義了『黎曼ζ函數』,並證明了它的一些基本的性質。那麼為什麼今天稱之為『巴塞爾問題』的呢?因為『此處』這個『巴塞爾』,它正是『歐拉』和『伯努利』之家族的『家鄉』。那麼就這麽樣的一個『級數的和\sum \limits_{n=1}^\infty \frac{1}{n^2} = \lim \limits_{n \to +\infty}\left(\frac{1}{1^2} + \frac{1}{2^2} + \cdots + \frac{1}{n^2}\right) 能有什麼『重要性』的嗎?即使僅依據『發散級數』 divergent series 的『可加性』 summable  之『歷史』而言,或又得再過了百年的時間之後,也許早已經是『柯西』之『極限觀』天下後『再議論』的了!!因是我們總該看看『歷史』上『歐拉』自己的『論證』的吧!!

220px-PI.svg
巴塞爾問題
\sum_{n=1}^{\infty}\frac{1}{n^2} = \frac{\pi^2}{6}

220px-Euler-10_Swiss_Franc_banknote_(front)

220px-Euler_GDR_stamp

Euler-USSR-1957-stamp

169px-Euler_Diagram.svg
邏輯之歐拉圖

假使說『三角函數』  \sin{x} 可以表示為 \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots,那麼『除以x 後,將會得到 \frac{\sin(x)}{x} = 1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \frac{x^6}{7!} + \cdots,然而 \sin{x} 的『』是 x = n\cdot\pi,由於『除以x 之緣故,因此 n \neq 0,所以 n = \pm1, \pm2, \pm3, \dots,那麼 \frac{\sin(x)}{x} 應該會『等於\left(1 - \frac{x}{\pi}\right)\left(1 + \frac{x}{\pi}\right)\left(1 - \frac{x}{2\pi}\right)\left(1 + \frac{x}{2\pi}\right)\left(1 - \frac{x}{3\pi}\right)\left(1 + \frac{x}{3\pi}\right) \cdots,於是也就『等於\left(1 - \frac{x^2}{\pi^2}\right)\left(1 - \frac{x^2}{4\pi^2}\right)\left(1 - \frac{x^2}{9\pi^2}\right) \cdots,若是按造『牛頓恆等式』,考慮 x^2 項的『係數』, 就會有 - \left(\frac{1}{\pi^2} + \frac{1}{4\pi^2} + \frac{1}{9\pi^2} + \cdots \right) = -\frac{1}{\pi^2}\sum_{n=1}^{\infty}\frac{1}{n^2},然而 \frac{\sin(x)}{x}  之『 x^2』的『係數』 是『- \frac{1}{3!} = -\frac{1}{6}』,所以 -\frac{1}{6} = -\frac{1}{\pi^2}\sum \limits_{n=1}^{\infty}\frac{1}{n^2},於是 \sum \limits_{n=1}^{\infty}\frac{1}{n^2} = \frac{\pi^2}{6}。那麼『歐拉』是『』的嗎?還是他還是『』的呢??

─── 摘自《【Sonic π】電聲學之電路學《四》之《 V!》‧下中

 

\pi 之名號響天際。

採用π為符號

現時所知,最早使用希臘字母π代表圓周率,是威爾斯數學家威廉·瓊斯的1706年著作《Synopsis Palmariorum Matheseos; or, a New Introduction to the Mathematics》。[24]書中首次出現希臘字母π,是討論半徑為1的圓時,在短語「1/2 Periphery (π)」之中。[25]他選用π,或許由於π是periphery(周邊)的希臘語對應單詞περιφέρεια的首字母。

然而,其他數學家未立刻跟從,有時數學家會用c, p等字母代表圓周率。[26]將π的這個用法推廣出去的,是數學家歐拉。他在1736年的著作《Mechanica》開始使用π。因為歐拉與歐洲其他數學家通信頻繁,這樣就把π的用法迅速傳播。[26] 1748年,歐拉在廣受閱讀的名著《無窮小分析引論》(Introductio in analysin infinitorum)使用π。他寫道:「為簡便故,我們將這數記為π,因此π=半徑為1的圓的半周長,換言之π是180度弧的長度。」於是π就在西方世界得到普遍接受。[26][27]

 

雖然 e^{i \pi} + 1 =0 猶在目,世間轉眼批評起。

批評

近年來,有部分學者認為約等於3.14的π「不合自然」,應該用雙倍於π、約等於6.28的一個常數代替。支持這一說法的學者認為在很多數學公式2π很常見,很少單獨使用一個π。美國哈佛大學物理學教授的邁克爾·哈特爾稱「圓形與直徑無關,而與半徑相關,圓形由離中心一定距離即半徑的一系列點構成」。並建議使用希臘字母τ來代替π[28][29][30]

美國數學家鮑勃·帕萊(Bob Palais)於2001年在《數學情報》(The Mathematical Intelligencer)上發表了一篇題為《π 是錯誤的!》(π Is Wrong!)的論文。在論文的第一段,鮑勃·帕萊說道:

幾個世紀以來,π 受到了無限的推崇和讚賞。數學家們歌頌 π 的偉大與神秘,把它當作數學界的象徵;計算器和程式設計語言裡也少不了 π 的身影;甚至有 一部電影 就直接以它命名⋯⋯但是,π 其實只是一個冒牌貨,真正值得大家敬畏和讚賞的,其實應該是一個不幸被我們稱作 2π 的數。

美國數學家麥克·哈特爾(Michael Hartl) 建立了網站 tauday.com ,呼籲人們用希臘字母 τ(發音:tau)來表示「正確的」圓周率 C/r。並建議大家以後在寫論文時,用一句「為方便起見,定義 τ = 2π 」開頭。

著名的 Geek 漫畫網站 spikedmath.com 建立了 thepimanifesto.com ,裡邊有一篇洋洋灑灑數千字的 π 宣言,反駁支持τ的言論,宣稱圓周率定義為周長與直徑之比有優越性,並認為在衡量圓柱形物體的截面大小時,直徑比半徑更方便測量。

 

千年長河萬年水,釀成『山巔一寺一壺酒』!

文化

背誦

世界記錄是100,000位,日本人原口證於2006年10月3日背誦圓周率π至小數點後100,000位。[31]

普通話用諧音記憶的有「山巔一寺一壺酒,爾樂苦煞吾,把酒吃,酒殺爾,殺不死,樂而樂」,就是3.1415926535897932384626。 另一諧音為:「山巔一石一壺酒,二妞舞扇舞,把酒沏酒搧又搧,飽死囉」,就是3.14159265358979323846。

英文, 會使用英文字母的長度作為數字,例如「How I want a drink, alcoholic of course, after the heavy lectures involving quantum mechanics. All of the geometry, Herr Planck, is fairly hard, and if the lectures were boring or tiring, then any odd thinking was on quartic equations again.」就是3.1415926535897932384626433832795。

───

 

若問 \pi 何物也??『科技術數』 STEM 之歷史旗幟乎!!如是看來『派』 Pi 當真是『養生之道』耶??!!

或許 Michael Nielsen 先生長於『養智』,善於『啟發』學者,故而『習題』比『正文』內容還多,大概希望讀者『動手用腦』的吧︰

Let’s return to the toy example we played with earlier, and explore what happens when we use the cross-entropy instead of the quadratic cost. To re-orient ourselves, we’ll begin with the case where the quadratic cost did just fine, with starting weight 0.6 and starting bias 0.9. Press “Run” to see what happens when we replace the quadratic cost by the cross-entropy:

Unsurprisingly, the neuron learns perfectly well in this instance, just as it did earlier. And now let’s look at the case where our neuron got stuck before (link, for comparison), with the weight and bias both starting at 2.0:

Success! This time the neuron learned quickly, just as we hoped. If you observe closely you can see that the slope of the cost curve was much steeper initially than the initial flat region on the corresponding curve for the quadratic cost. It’s that steepness which the cross-entropy buys us, preventing us from getting stuck just when we’d expect our neuron to learn fastest, i.e., when the neuron starts out badly wrong.

I didn’t say what learning rate was used in the examples just illustrated. Earlier, with the quadratic cost, we used \eta = 0.15. Should we have used the same learning rate in the new examples? In fact, with the change in cost function it’s not possible to say precisely what it means to use the “same” learning rate; it’s an apples and oranges comparison. For both cost functions I simply experimented to find a learning rate that made it possible to see what is going on. If you’re still curious, despite my disavowal, here’s the lowdown: I used \eta = 0.005 in the examples just given.

You might object that the change in learning rate makes the graphs above meaningless. Who cares how fast the neuron learns, when our choice of learning rate was arbitrary to begin with?! That objection misses the point. The point of the graphs isn’t about the absolute speed of learning. It’s about how the speed of learning changes. In particular, when we use the quadratic cost learning is slower when the neuron is unambiguously wrong than it is later on, as the neuron gets closer to the correct output; while with the cross-entropy learning is faster when the neuron is unambiguously wrong. Those statements don’t depend on how the learning rate is set.

We’ve been studying the cross-entropy for a single neuron. However, it’s easy to generalize the cross-entropy to many-neuron multi-layer networks. In particular, suppose y = y_1, y_2, \ldots are the desired values at the output neurons, i.e., the neurons in the final layer, while a^L_1, a^L_2, \ldots are the actual output values. Then we define the cross-entropy by

\sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right]. \ \ \ \ (63)

This is the same as our earlier expression, Equation (57), except now we’ve got the \sum_j summing over all the output neurons. I won’t explicitly work through a derivation, but it should be plausible that using the expression (63) avoids a learning slowdown in many-neuron networks. If you’re interested, you can work through the derivation in the problem below.

When should we use the cross-entropy instead of the quadratic cost? In fact, the cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons. To see why, consider that when we’re setting up the network we usually initialize the weights and biases using some sort of randomization. It may happen that those initial choices result in the network being decisively wrong for some training input – that is, an output neuron will have saturated near 1, when it should be 0, or vice versa. If we’re using the quadratic cost that will slow down learning. It won’t stop learning completely, since the weights will continue learning from other training inputs, but it’s obviously undesirable.

Exercises

  • One gotcha with the cross-entropy is that it can be difficult at first to remember the respective roles of the ys and the as. It’s easy to get confused about whether the right form is -[y \ln a + (1-y) \ln (1-a)] or -[a \ln y + (1-a) \ln (1-y)]. What happens to the second of these expressions when y = 0 or 1? Does this problem afflict the first expression? Why or why not?
  • In the single-neuron discussion at the start of this section, I argued that the cross-entropy is small if \sigma(z) \approx y for all training inputs. The argument relied on y being equal to either 0 or 1. This is usually true in classification problems, but for other problems (e.g., regression problems) y can sometimes take values intermediate between 0 and 1. Show that the cross-entropy is still minimized when \sigma(z) = y for all training inputs. When this is the case the cross-entropy has the value:
    C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)]. \ \ \ \ (64)

    The quantity -[y \ln y+(1-y)\ln(1-y)] is sometimes known as the binary entropy.

Problems

  • Many-layer multi-neuron networks In the notation introduced in the last chapter, show that for the quadratic cost the partial derivative with respect to weights in the output layer is
    \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \sigma'(z^L_j). \ \ \ \ (65)

    The term \sigma'(z^L_j) causes a learning slowdown whenever an output neuron saturates on the wrong value. Show that for the cross-entropy cost the output error \delta^L for a single training example x is given by

    \delta^L = a^L-y. \ \ \ \ (66)

    Use this expression to show that the partial derivative with respect to the weights in the output layer is given by

    \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j). \ \ \ \ (67)

    The \sigma'(z^L_j) term has vanished, and so the cross-entropy avoids the problem of learning slowdown, not just when used with a single neuron, as we saw earlier, but also in many-layer multi-neuron networks. A simple variation on this analysis holds also for the biases. If this is not obvious to you, then you should work through that analysis as well.

  • Using the quadratic cost when we have linear neurons in the output layer Suppose that we have a many-layer multi-neuron network. Suppose all the neurons in the final layer are linear neurons, meaning that the sigmoid activation function is not applied, and the outputs are simply a^L_j = z^L_j. Show that if we use the quadratic cost function then the output error \delta^L for a single training example x is given by
    \delta^L = a^L-y. \ \ \ \ (68)

    Similarly to the previous problem, use this expression to show that the partial derivatives with respect to the weights and biases in the output layer are given by

    \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j) \ \ \ \ (69)

    \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x (a^L_j-y_j). \ \ \ \ (70)

     

    This shows that if the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown. In this case the quadratic cost is, in fact, an appropriate cost function to use.

───

 

因是,此處不宜再多說什麼,願人人能『築夢踏實』。

老子四十二章中講︰道生一,一生二,二生三,三生萬物。是說天地生萬物就像四季循環自然而然,如果『』或將成為『亦大』,就得知道大自然 『之道,能循本能『得一』。他固善於『觀水』,盛讚『上善若水』,卻也深知水為山堵之『』、人為慾阻之『』難,故於第三十九章中又講︰

得一者得一以得一以得一以得一以萬物得一以侯王得一以為天下貞其致之天無恐裂地無恐發神無恐歇谷無恐竭萬物無恐滅侯王無貞高恐蹶。故貴以賤為本高以下為基。是以侯王自謂孤寡不穀,此非以賤為本耶?非乎?人之所惡,唯孤寡不穀,而侯王以為稱。故致譽無譽不欲琭琭如玉,珞珞如石

,希望人們知道所謂『道德』之名,實在說的是『得到』── 得道── 的啊!!如果乾坤都『沒路』可走,人又該往向『何方』??

─── 摘自《跟隨□?築夢!!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】三

論衡》‧《訂鬼

凡天地之間有鬼,非人死精神為之也,皆人思念存想之所致也。致之何由?由於疾病。人病則憂懼,憂懼見鬼出。

凡人不病則不畏懼。故得病寢衽,畏懼鬼至。畏懼則存想,存想則目虛見。何以效之?《傳》曰:「伯樂學相馬,顧玩所見,無非馬者。宋之庖丁學解牛,三年不見生牛,所見皆死牛也。」二者用精至矣,思念存想,自見異物也。人病見鬼,猶伯樂之見馬,庖丁之見牛也。伯樂、庖丁所見非馬與牛,則亦知夫病者所見非鬼也。病者困劇身體痛,則謂鬼持箠杖敺擊之,若見鬼把椎鎖繩纆立守其旁 ,病痛恐懼,妄見之也。初疾畏驚,見鬼之來;疾困恐死,見鬼之怒;身自疾痛,見鬼之擊,皆存想虛致,未必有其實也。夫精念存想,或泄於目,或泄於口,或泄於耳。泄於目,目見其形;泄於耳 ,耳聞其聲;泄於口,口言其事。晝日則鬼見,暮臥則夢聞。獨臥空室之中,若有所畏懼,則夢見夫人據案其身哭矣。覺見臥聞,俱用精神;畏懼、存想,同一實也。

一曰:人之見鬼,目光與臥亂也。人之晝也,氣倦精盡,夜則欲臥 ,臥而目光反,反而精神見人物之象矣。人病亦氣倦精盡,目雖不臥,光已亂於臥也,故亦見人物象。病者之見也,若臥若否,與夢相似。當其見也,其人能自知覺與夢,故其見物不能知其鬼與人,精盡氣倦之效也。何以驗之?以狂者見鬼也。狂癡獨語,不與善人相得者,病困精亂也。夫病且死之時,亦與狂等。臥、病及狂,三者皆精衰倦,目光反照,故皆獨見人物之象焉。

一曰:鬼者、人所見得病之氣也。氣不和者中人,中人為鬼,其氣象人形而見。故病篤者氣盛,氣盛則象人而至,至則病者見其象矣 。假令得病山林之中,其見鬼則見山林之精。人或病越地者,病見越人坐其側。由此言之,灌夫、竇嬰之徒,或時氣之形象也。凡天地之間,氣皆純於天,天文垂象於上,其氣降而生物。氣和者養生 ,不和者傷害。本有象於天,則其降下,有形於地矣。故鬼之見也 ,象氣為之也。眾星之體,為人與鳥獸,故其病人則見人與鳥獸之形。

一曰:鬼者、老物精也。夫物之老者,其精為人;亦有未老,性能變化,象人之形。人之受氣,有與物同精者,則其物與之交。及病 ,精氣衰劣也,則來犯陵之矣。何以效之?成事:俗間與物交者,見鬼之來也。夫病者所見之鬼,與彼病物何以異?人病見鬼來,象其墓中死人來迎呼之者,宅中之六畜也。及見他鬼,非是所素知者 ,他家若草野之中物為之也。

一曰:鬼者、本生於人。時不成人,變化而去。天地之性,本有此化,非道術之家所能論辯。與人相觸犯者病,病人命當死,死者不離人。何以明之?《禮》曰:「顓頊氏有三子,生而亡去為疫鬼:一居江水,是為虐鬼;一居若水,是為魍魎鬼;一居人宮室區隅漚庫,善驚人小兒。」前顓頊之世,生子必多,若顓頊之鬼神以百數也。諸鬼神有形體法,能立樹與人相見者,皆生於善人,得善人之氣,故能似類善人之形,能與善人相害。陰陽浮游之類,若雲煙之氣,不能為也。

一曰:鬼者、甲乙之神也。甲乙者、天之別氣也,其形象人。人病且死,甲乙之神至矣。假令甲乙之日病,則死見庚辛之神矣。何則 ?甲乙鬼,庚辛報甲乙,故病人且死,殺鬼之至者,庚辛之神也。何以效之?以甲乙日病者,其死生之期,常在庚辛之日。此非論者所以為實也。天道難知,鬼神闇昧,故具載列,令世察之也。

一曰:鬼者、物也,與人無異。天地之間,有鬼之物,常在四邊之外,時往來中國,與人雜則,凶惡之類也,故人病且死者乃見之。天地生物也,有人如鳥獸,及其生凶物,亦有似人象鳥獸者。故凶禍之家,或見蜚尸,或見走凶,或見人形,三者皆鬼也。或謂之鬼 ,或謂之凶,或謂之魅,或謂之魑,皆生存實有,非虛無象類之也。何以明之?成事:俗間家人且凶,見流光集其室,或見其形若鳥之狀,時流人堂室,察其不謂若鳥獸矣。夫物有形則能食,能食則便利。便利有驗,則形體有實矣。《左氏春秋》曰:「投之四裔 ,以禦魑魅。」《山海經》曰:「北方有鬼國。」說螭者謂之龍物也,而魅與龍相連,魅則龍之類矣。又言「國」、人物之黨也。《山海經》又曰:「滄海之中,有度朔之山,上有大桃木,其屈蟠三千里,其枝間東北曰鬼門,萬鬼所出入也。上有二神人,一曰神荼,一曰鬱壘,主閱領萬鬼。惡害之鬼,執以葦索,而以食虎。於是黃帝乃作禮以時驅之,立大桃人,門戶畫神荼、鬱壘與虎,懸葦索以禦。」凶魅有形,故執以食虎。案可食之物,無空虛者。其物也,性與人殊,時見時匿,與龍不常見,無以異也。

一曰:人且吉凶,妖祥先見。人之且死,見百怪,鬼在百怪之中。故妖怪之動,象人之形,或象人之聲為應,故其妖動不離人形。天地之間,妖怪非一,言有妖,聲有妖,文有妖。或妖氣象人之形,或人含氣為妖。象人之形,諸所見鬼是也;人含氣為妖,巫之類是也。是以實巫之辭,無所因據,其吉凶自從口出,若童之謠矣。童謠口自言,巫辭意自出。口自言,意自出,則其為人,與聲氣自立 ,音聲自發,同一實也。世稱紂之時,夜郊鬼哭,及倉頡作書,鬼夜哭。氣能象人聲而哭,則亦能象人形而見,則人以為鬼矣。

鬼之見也,人之妖也。天地之間,禍福之至,皆有兆象,有漸不卒然,有象不猥來。天地之道,人將亡,凶亦出;國將亡,妖亦見。猶人且吉,吉祥至;國且昌,昌瑞到矣。故夫瑞應妖祥,其實一也 。而世獨謂鬼者不在妖祥之中,謂鬼猶神而能害人,不通妖祥之道 ,不睹物氣之變也。國將亡,妖見,其亡非妖也。人將死,鬼來,其死非鬼也。亡國者、兵也,殺人者、病也。何以明之?齊襄公將為賊所殺,游于姑棼,遂田于貝丘,見大豕。從者曰:「公子彭生也。」公怒曰:「彭生敢見!」引弓射之,豕人立而啼。公懼,墜于車,傷足,喪履,而為賊殺之。夫殺襄公者,賊也。先見大豕於路,則襄公且死之妖也。人謂之彭生者,有似彭生之狀也。世人皆知殺襄公者非豕,而獨謂鬼能殺人,一惑也。

天地之氣為妖者,太陽之氣也。妖與毒同,氣中傷人者謂之毒,氣變化者謂之妖。世謂童謠,熒惑使之,彼言有所見也。熒惑火星,火有毒熒,故當熒惑守宿,國有禍敗。火氣恍惚,故妖象存亡。龍、陽物也,故時變化。鬼、陽氣也,時藏時見。陽氣赤,故世人盡見鬼,其色純朱。蜚凶、陽也,陽、火也,故蜚凶之類為火光。火熱焦物,故止集樹木,枝葉枯死。《鴻範》五行二曰火,五事二曰言。言、火同氣,故童謠、詩歌為妖言。言出文成,故世有文書之怪。世謂童子為陽,故妖言出於小童。童、巫含陽,故大雩之祭 ,舞童暴巫。雩祭之禮,倍陰合陽,故猶日食陰勝,攻社之陰也。日食陰勝,故攻陰之類。天旱陽勝,故愁陽之黨。巫為陽黨,故魯僖遭旱,議欲焚巫。巫含陽氣,以故陽地之民多為巫。巫黨於鬼,故巫者為鬼巫。鬼巫比於童謠,故巫之審者,能處吉凶。吉凶能處 ,吉凶之徒也,故申生之妖見於巫。巫含陽,能見為妖也。申生為妖,則知杜伯、莊子義、厲鬼之徒皆妖也。杜伯之厲為妖,則其弓矢、投、措皆妖毒也。妖象人之形,其毒象人之兵。鬼、毒同色,故杜伯弓矢皆朱彤也。毒象人之兵,則其中人,人輒死也。中人微者即為腓,病者不即時死。何則?腓者、毒氣所加也。

妖或施其毒,不見其體;或見其形,不施其毒;或出其聲,不成其言;或明其言,不知其音。若夫申生,見其體、成其言者也;杜伯之屬,見其體、施其毒者也;詩妖、童謠、石言之屬,明其言者也 ;濮水琴聲,紂郊鬼哭,出其聲者也。妖之見出也,或且凶而豫見 ,或凶至而因出。因出,則妖與毒俱行;豫見,妖出不能毒。申生之見,豫見之妖也;杜伯、莊子義、厲鬼至,因出之妖也。周宣王 、燕簡公、宋夜姑時當死,故妖見毒因擊。晉惠公身當獲,命未死 ,故妖直見而毒不射。然則杜伯、莊子義、厲鬼之見,周宣王、燕簡、夜姑且死之妖也。申生之出,晉惠公且見獲之妖也。伯有之夢 ,駟帶、公孫叚且卒之妖也。老父結草,魏顆且勝之祥,亦或時杜回見獲之妖也。蒼犬噬呂后,呂后且死,妖象犬形也。,武安且卒 ,妖象竇嬰、灌夫之面也。

故凡世間所謂妖祥、所謂鬼神者,皆太陽之氣為之也。太陽之氣、天氣也。天能生人之體,故能象人之容。夫人所以生者,陰、陽氣也。陰氣主為骨肉,陽氣主為精神。人之生也,陰、陽氣具,故骨肉堅,精氣盛。精氣為知,骨肉為強,故精神言談,形體固守。骨肉精神,合錯相持,故能常見而不滅亡也。太陽之氣,盛而無陰,故徒能為象,不能為形。無骨肉,有精氣,故一見恍惚,輒復滅亡也。

 

『鬼』 鬼 不知是何物也?《說文解字》講:鬼,人所歸為鬼。从人,象鬼頭。鬼陰气賊害,从厶。凡鬼之屬皆从鬼。鬼一,古文从示。那『鬼』就是『人之歸』耶!東漢王充認為不是『人死精神』為之,『訂』 訂之為『人思念存想之所致』也?!然而

淮南子》‧《說山訓》有言︰

魄問於魂曰:「道何以為體?」曰:「以無有為體。」魄曰:「無有有形乎?」魂曰:「無有。」「何得而聞也?」魂曰:「吾直有所遇之耳。視之無形,聽之無聲,謂之幽冥。幽冥者,所以喻道,而非道也。魄曰:「吾聞得之矣。乃內視而自反也。」魂曰:「凡得道者,形不可得而見,名不可得而揚。今汝已有形名矣,何道之所能乎!」魄曰:「言者,獨何為者?」「吾將反吾宗矣。」魄反顧,魂忽然不見,反而自存,亦以淪於無形矣。

人不小學,不大迷;不小慧,不大愚。人莫鑒於沫雨,而鑒於澄水者,以其休止不蕩也。詹公之釣,千歲之鯉不能避;曾子攀柩車,引楯者為之止也;老母行歌而動申喜,精之至也;瓠巴鼓瑟,而淫魚出聽;伯牙鼓琴,駟馬仰秣;介子歌龍蛇,而文君垂泣。故玉在山而草木潤,淵生珠而岸不枯。螾無筋骨之強,爪牙之利,上食晞堁,下飲黃泉,用心一也。清之為明,杯水見眸子;濁之為暗,河水不見太山。視日者眩,聽雷者聾;人無為則治,有為則傷。無為而治者,載無也;為者,不能有也;不能無為者,不能有為也。人無言而神,有言則傷。無言而神者載無,有言則傷其神。之神者,鼻之所以息,耳之所以聽,終以其無用者為用矣。

……

畏馬之辟也,不敢騎;懼車之覆也,不敢乘;是以虛禍距公利也。不孝弟者,或詈父母。生子者,所不能任其必孝也,然猶養而長之 。範氏之敗,有竊其鍾負而走者,鎗然有聲,懼人聞之,遽掩其耳 。憎人聞之,可也;自掩其耳,悖矣。升之不能大於石也,升在石之中;夜不能修其歲也,夜在歲之中;仁義之不能大於道德也,仁義在道德之包。先針而後縷,可以成帷;先縷而後針,不可以成衣 。針成幕,蔂成城。事之成敗,必由小生。言有漸也。染者先青而後黑則可,先黑而後青則不可;工人下漆而上丹則可,下丹而上漆則不可。萬事由此,所先後上下,不可不審。水濁而魚噞,形勞而神亂。故國有賢君,折沖萬里。因媒而嫁,而不因媒而成;因人而交,不因人而親。行合趨同,千里相從;行不合,趨不同,對門不通。海水雖大,不受胔芥,日月不應非其氣,君子不容非其類也。人不愛倕之手,而愛己之指,不愛江、漢之珠,而愛己之鉤。以束薪為鬼,以火煙為氣。以束薪為鬼,朅而走;以火煙為氣,殺豚烹狗。先事如此,不如其後。巧者善度,知者善豫。羿死桃部,不給射;慶忌死劍鋒,不給搏。滅非者戶告之曰:「我實不與我諛亂。 」謗乃愈起。止言以言,止事以事,譬猶揚堁而弭塵,抱薪而救火 。流言雪汙,譬猶以涅拭素也。

───

 

,或以為『鬼』乃『束薪火煙』耶!!??

『抽象事物』既已『抽象』,故『無象』可見,不過尚有『定義』在焉︰

\bigcirc \ln \square + (1- \bigcirc ) \ln (1- \square)

,因而無需玄想、不必狐疑,自可原理推論矣。所以莫問

Cross entropy

In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.

The cross entropy for the distributions p and q over a given set is defined as follows:

H(p, q) = \operatorname{E}_p[-\log q] = H(p) + D_{\mathrm{KL}}(p \| q),\!

where H(p) is the entropy of p, and D_{\mathrm{KL}}(p || q) is the Kullback–Leibler divergence of q from p (also known as the relative entropy of p with respect to q — note the reversal of emphasis).

For discrete p and q this means

H(p, q) = -\sum_x p(x)\, \log q(x). \!

The situation for continuous distributions is analogous:

-\int_X p(x)\, \log q(x)\, dx. \!

NB: The notation H(p,q) is also used for a different concept, the joint entropy of p and q.

……

Cross-entropy error function and logistic regression

Cross entropy can be used to define the loss function in machine learning and optimization. The true probability p_i is the true label, and the given distribution q_i is the predicted value of the current model.

More specifically, let us consider logistic regression, which (in its most basic form) deals with classifying a given set of data points into two possible classes generically labelled 0 and 1. The logistic regression model thus predicts an output y\in\{0,1\}, given an input vector \mathbf{x}. The probability is modeled using the logistic function g(z)=1/(1+e^{-z}). Namely, the probability of finding the output y=1 is given by

q_{y=1}\ =\ \hat{y}\ \equiv\ g(\mathbf{w}\cdot\mathbf{x})\,,

where the vector of weights \mathbf{w} is learned through some appropriate algorithm such as gradient descent. Similarly, the conjugate probability of finding the output y=0 is simply given by

q_{y=0}\ =\ 1-\hat{y}

The true (observed) probabilities can be expressed similarly as p_{y=1}=y and p_{y=0}=1-y.
Having set up our notation, p\in\{y,1-y\} and q\in\{\hat{y},1-\hat{y}\}, we can use cross entropy to get a measure for similarity between p and q:

H(p,q)\ =\ -\sum_ip_i\log q_i\ =\ -y\log\hat{y} - (1-y)\log(1-\hat{y})

The typical loss function that one uses in logistic regression is computed by taking the average of all cross-entropies in the sample. For example, suppose we have N samples with each sample labeled by n=1,\dots,N. The loss function is then given by:

where \hat{y}_n\equiv g(\mathbf{w}\cdot\mathbf{x}_n), with g(z) the logistic function as before.
The logistic loss is sometimes called cross-entropy loss. It’s also known as log loss (In this case, the binary label is often denoted by {-1,+1}).[1]

───

 

是什麼?直須了解『此物定義』即可!事實上『它的意義』人們還在探索中,『熱力學』與『統計力學』也不曾見其跡蹤`。如是可知 Michael Nielsen 先生行文之為難也︰

Introducing the cross-entropy cost function

How can we address the learning slowdown? It turns out that we can solve the problem by replacing the quadratic cost with a different cost function, known as the cross-entropy. To understand the cross-entropy, let’s move a little away from our super-simple toy model. We’ll suppose instead that we’re trying to train a neuron with several input variables, x_1, x_2, \ldots, corresponding weights w_1, w_2, \ldots, and a bias, b:

The output from the neuron is, of course, a = \sigma(z), where z = \sum_j w_j x_j+b is the weighted sum of the inputs. We define the cross-entropy cost function for this neuron by

C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right], \ \ \ \ (57)

where n is the total number of items of training data, the sum is over all training inputs, x, and y is the corresponding desired output.

It’s not obvious that the expression (57) fixes the learning slowdown problem. In fact, frankly, it’s not even obvious that it makes sense to call this a cost function! Before addressing the learning slowdown, let’s see in what sense the cross-entropy can be interpreted as a cost function.

Two properties in particular make it reasonable to interpret the cross-entropy as a cost function. First, it’s non-negative, that is, C > 0. To see this, notice that: (a) all the individual terms in the sum in (57) are negative, since both logarithms are of numbers in the range 0 to 1; and (b) there is a minus sign out the front of the sum.

Second, if the neuron’s actual output is close to the desired output for all training inputs, x, then the cross-entropy will be close to zero*

*To prove this I will need to assume that the desired outputs y are all either 0 or 1. This is usually the case when solving classification problems, for example, or when computing Boolean functions. To understand what happens when we don’t make this assumption, see the exercises at the end of this section.

. To see this, suppose for example that y = 0 and a \approx 0 for some input x. This is a case when the neuron is doing a good job on that input. We see that the first term in the expression (57) for the cost vanishes, since y = 0, while the second term is just -\ln (1-a) \approx 0. A similar analysis holds when y = 1 and a \approx 1. And so the contribution to the cost will be low provided the actual output is close to the desired output.

Summing up, the cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output, y, for all training inputs, x. These are both properties we’d intuitively expect for a cost function. Indeed, both properties are also satisfied by the quadratic cost. So that’s good news for the cross-entropy. But the cross-entropy cost function has the benefit that, unlike the quadratic cost, it avoids the problem of learning slowing down. To see this, let’s compute the partial derivative of the cross-entropy cost with respect to the weights. We substitute a = \sigma(z) into (57), and apply the chain rule twice, obtaining:

\frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left( \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right) \frac{\partial \sigma}{\partial w_j} \ \ \ \ (58)
=  -\frac{1}{n} \sum_x \left( \frac{y}{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j. \ \ \ \ (59)

Putting everything over a common denominator and simplifying this becomes:

\frac{\partial C}{\partial w_j} & = & \frac{1}{n} \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))} (\sigma(z)-y). \ \ \ \ (60)

Using the definition of the sigmoid function, \sigma(z) = 1/(1+e^{-z}), and a little algebra we can show that \sigma'(z) = \sigma(z)(1-\sigma(z)). I’ll ask you to verify this in an exercise below, but for now let’s accept it as given. We see that the \sigma'(z) and \sigma(z)(1-\sigma(z)) terms cancel in the equation just above, and it simplifies to become:

\frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y). \ \ \ \ (61)

This is a beautiful expression. It tells us that the rate at which the weight learns is controlled by \sigma(z)-y, i.e., by the error in the output. The larger the error, the faster the neuron will learn. This is just what we’d intuitively expect. In particular, it avoids the learning slowdown caused by the \sigma'(z) term in the analogous equation for the quadratic cost, Equation (55). When we use the cross-entropy, the σ(z) term gets canceled out, and we no longer need worry about it being small. This cancellation is the special miracle ensured by the cross-entropy cost function. Actually, it’s not really a miracle. As we’ll see later, the cross-entropy was specially chosen to have just this property.

In a similar way, we can compute the partial derivative for the bias. I won’t go through all the details again, but you can easily verify that

\frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y). \ \ \ \ (62)

Again, this avoids the learning slowdown caused by the \sigma'(z) term in the analogous equation for the quadratic cost, Equation (56).

───

 

彷彿『cross-entropy』只是為著解決『the learning slowdown 』而生乎!!??

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】二

老鐵匠的生命鏈條

發布者︰淺陌安然

有個老鐵匠,他打的鐵比誰的都要牢靠,可是他不善言辭,賣出的鐵鏈很少。人家說他太老實,但他不管這些,仍舊把每一根鐵鏈都打得結結實實。他打的一條鐵鏈裝在一艘大海輪上作為主錨鏈,但卻從來沒有用過。一天晚上,海上風暴驟起,隨時都可能把船冲到礁石上。船上所有錨鏈都放下海裡,但很快都被掙斷,只有老鐵匠那條鐵鍊還緊緊拉着風口浪尖上的輪船。在無數個鏈環中,哪怕有一環斷裂,船上1000多名乘客和貨物都將被死神吞噬!經歷了一夜的暴風驟雨的考驗,老鐵匠的那條鐵鍊還牢牢抓著海底的岩石。當離明來到,風平浪靜,所有的人為此熱淚盈眶,歡騰不已……

 

點評︰成功源于對完美的苛求和點滴的積累,失敗源于一系列細小錯誤的累加。成功沒有捷徑,唯有精益求精;卓越源自嚴謹,重在細節完美。

 

若講用『小故事』宣說『大道理』,看似容易,實在是非常困難之事!所以此處 Michael Nielsen 先生異筆突起,大書特書一個小小『神經元』之『零壹學習』問題??當真是不可思議之筆法也!!

The cross-entropy cost function

Most of us find it unpleasant to be wrong. Soon after beginning to learn the piano I gave my first performance before an audience. I was nervous, and began playing the piece an octave too low. I got confused, and couldn’t continue until someone pointed out my error. I was very embarrassed. Yet while unpleasant, we also learn quickly when we’re decisively wrong. You can bet that the next time I played before an audience I played in the correct octave! By contrast, we learn more slowly when our errors are less well-defined.

Ideally, we hope and expect that our neural networks will learn fast from their errors. Is this what happens in practice? To answer this question, let’s look at a toy example. The example involves a neuron with just one input:

We’ll train this neuron to do something ridiculously easy: take the input 1 to the output 0. Of course, this is such a trivial task that we could easily figure out an appropriate weight and bias by hand, without using a learning algorithm. However, it turns out to be illuminating to use gradient descent to attempt to learn a weight and bias. So let’s take a look at how the neuron learns.

To make things definite, I’ll pick the initial weight to be 0.6 and the initial bias to be 0.9. These are generic choices used as a place to begin learning, I wasn’t picking them to be special in any way. The initial output from the neuron is 0.82, so quite a bit of learning will be needed before our neuron gets near the desired output, 0.0. Click on “Run” in the bottom right corner below to see how the neuron learns an output much closer to 0.0. Note that this isn’t a pre-recorded animation, your browser is actually computing the gradient, then using the gradient to update the weight and bias, and displaying the result. The learning rate is \eta = 0.15, which turns out to be slow enough that we can follow what’s happening, but fast enough that we can get substantial learning in just a few seconds. The cost is the quadratic cost function, C, introduced back in Chapter 1. I’ll remind you of the exact form of the cost function shortly, so there’s no need to go and dig up the definition. Note that you can run the animation multiple times by clicking on “Run” again.

Sigmoid-1

As you can see, the neuron rapidly learns a weight and bias that drives down the cost, and gives an output from the neuron of about 0.09. That’s not quite the desired output, 0.0, but it is pretty good. Suppose, however, that we instead choose both the starting weight and the starting bias to be 2.0. In this case the initial output is 0.98, which is very badly wrong. Let’s look at how the neuron learns to output 0 in this case. Click on “Run” again:

Sigmoid-2

Although this example uses the same learning rate (\eta =0.15), we can see that learning starts out much more slowly. Indeed, for the first 150 or so learning epochs, the weights and biases don’t change much at all. Then the learning kicks in and, much as in our first example, the neuron’s output rapidly moves closer to 0.0.

This behaviour is strange when contrasted to human learning. As I said at the beginning of this section, we often learn fastest when we’re badly wrong about something. But we’ve just seen that our artificial neuron has a lot of difficulty learning when it’s badly wrong – far more difficulty than when it’s just a little wrong. What’s more, it turns out that this behaviour occurs not just in this toy model, but in more general networks. Why is learning so slow? And can we find a way of avoiding this slowdown?

To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial derivatives of the cost function, \partial C/\partial w and \partial C / \partial b. So saying “learning is slow” is really the same as saying that those partial derivatives are small. The challenge is to understand why they are small. To understand that, let’s compute the partial derivatives. Recall that we’re using the quadratic cost function, which, from Equation (6), is given by

C = \frac{(y-a)^2}{2}, \ \ \ \ (54)

where a is the neuron’s output when the training input x = 1 is used, and y = 0 is the corresponding desired output. To write this more explicitly in terms of the weight and bias, recall that a = \sigma(z), where z = wx+b. Using the chain rule to differentiate with respect to the weight and bias we get

\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \ \ \ \ (55)
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z), \ \ \ \ (56)

where I have substituted x = 1 and y = 0. To understand the behaviour of these expressions, let’s look more closely at the \sigma'(z) term on the right-hand side. Recall the shape of the \sigma function:

 Sigmoid-3

We can see from this graph that when the neuron’s output is close to 1, the curve gets very flat, and so \sigma'(z) gets very small. Equations (55) and (56) then tell us that \partial C/\partial w and \partial C / \partial b get very small. This is the origin of the learning slowdown. What’s more, as we shall see a little later, the learning slowdown occurs for essentially the same reason in more general neural networks, not just the toy example we’ve been playing with.

───

 

所謂『平凡中見偉大,細微處真工夫。』是『深入淺出』者之本事 ,因此方能信手拈來作『平常』 trival 之論的耶!!??

想那『Sigmoid』因為特色具足,也與『感知器』  Perceptrons 有著千絲萬縷的聯繫,故而雀屏中選,所以能在『神經網絡』各類文章中脫穎而出乎??!!

然而人們卻常既希望『馬兒善跑』,又希望『馬兒不吃草』,於是『S 神經元』的『穩定性高』,反成了『改變慢』的哩???不過若非那條『S 曲線』之『兩頭平緩』果利『學習』嗎!!!更不要說人們還想排除︰只因小小的『樣本差異』,『S 神經元』的輸出就從『正確變成錯誤』的呀!!??

舉例來說,到底

閾值電壓

閾值電壓英語:Threshold voltage[1],又稱閾電壓[2]開啟電壓,通常指的是在TTLMOSFET的傳輸特性曲線(輸出電壓與輸入電壓關係圖線)中,在轉折區中點所對應的輸入電壓的值。

當器件由空乏向反轉轉變時,要經歷一個Si表面電子濃度等於電洞濃度的狀態。此時器件處於臨界導通狀態,器件的閘極電壓定義為閾值電壓,它是MOSFET的重要參數之一。

Threshold_formation_nowatermark

計算機仿真展現的奈米線MOSFET中反型溝道的形成(電子密度的變化)。閾值電壓在0.45V左右。

───

 

之『高低』與『好壞』該當如之何決斷的呢??!!

再者對『神經網絡』而言,人麼將怎樣論述『遺忘』對『學習』之重要性也???

最後僅補之以一圖,盼一目了然矣!!!

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> N = 100
>>> I = np.arange(-10,10, 1.0/N)
>>> Sigmoid = 1 / ( 1 + np.exp(-1 * I))
>>> plt.subplot(3,1,1)
<matplotlib.axes.AxesSubplot object at 0x350cd10>
>>> plt.plot(I, Sigmoid, 'k-')
[<matplotlib.lines.Line2D object at 0x3169a90>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x351db50>
>>> plt.ylabel('sigmoid output')
<matplotlib.text.Text object at 0x3521ad0>
>>> plt.subplot(3,1,2)
<matplotlib.axes.AxesSubplot object at 0x353f950>
>>> SigmoidPrime = Sigmoid * (1 - Sigmoid)
>>> plt.plot(I, SigmoidPrime, 'r-')
[<matplotlib.lines.Line2D object at 0x353fd10>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x3541bd0>
>>> plt.ylabel("sigmoid' output")
<matplotlib.text.Text object at 0x3546b50>
>>> plt.subplot(3,1,3)
<matplotlib.axes.AxesSubplot object at 0x38f69d0>
>>> Cost = Sigmoid * SigmoidPrime
>>> plt.plot(I, Cost, 'b-')
[<matplotlib.lines.Line2D object at 0x38f6d90>]
>>> plt.xlabel('z')
<matplotlib.text.Text object at 0x38fac50>
>>> plt.ylabel("Cost output")
<matplotlib.text.Text object at 0x38fdbd0>
>>> plt.show()
>>> 

 

Sigmoid-4