W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】六

三十六計》南北朝‧檀道濟

瞞天過海

備周則意怠;常見則不疑。陰在陽之內,不在陰之外。太陽,太陰 。

唐太宗貞觀十七年,太宗領軍三十萬東征,太宗會暈船,薛仁貴怕皇上不敢過海而退兵,故假扮為一豪民,拜見唐太宗,邀請太宗文武百官到他家作客,豪民家飾以繡幔彩錦,環繞於室,好不漂亮,太宗與百官遂於豪民家飲酒作樂。不久,房室搖晃,杯酒落地,太宗等人驚嚇,揭開繡幔彩錦,發現他與三十萬大軍已在海上。古時皇帝自稱天子,故瞞「天」過海的天,指的是皇帝,此計遂稱為瞞天過海。

 

兵法講究『陰陽』,伺候打探『消息』,事件給予『情報』。常見則『發生頻率』高,因太普通故不生疑,認為少有『資訊價值』也 !若說有人能從此處建立『資訊理論』,當真是『資訊 bit 』比特值極高的乎?

克勞德·夏農

克勞德·艾爾伍德·夏農Claude Elwood Shannon,1916年4月30日-2001年2月26日),美國數學家電子工程師密碼學家,被譽為資訊理論的創始人。[1][2]夏農是密西根大學學士,麻省理工學院博士。

1948年,夏農發表了劃時代的論文——通訊的數學原理,奠定了現代資訊理論的基礎。不僅如此,夏農還被認為是數位計算機理論和數位電路設計理論的創始人。1937年,21歲的夏農是麻省理工學院的碩士研究生,他在其碩士論文中提出,將布爾代數應用於電子領域,能夠構建並解決任何邏輯和數值關係,被譽為有史以來最具水平的碩士論文之一[3]。二戰期間,夏農為軍事領域的密分碼析——密碼破譯和保密通訊——做出了很大貢獻。

───

 

無奈這把『資訊尺』用了許多 丈二金剛摸不着頭的『術語』───傳輸器、通道、接收器、雜訊源、熵、期望值、機率、資訊內容… ,維基百科詞條讀來宛若『天書』耶??

Entropy (information theory)

In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modifies the message in some way. The receiver attempts to infer which message was sent. In this context, entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. ‘Messages’ can be modeled by any flow of information.

In a more technical sense, there are reasons (explained below) to define information as the negative of the logarithm of the probability distribution. The probability distribution of the events, coupled with the information amount of every event, forms a random variable whose expected value is the average amount of information, or entropy, generated by this distribution. Units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to define it, though the shannon is commonly referred to as a bit.

The logarithm of the probability distribution is useful as a measure of entropy because it is additive for independent sources. For instance, the entropy of a coin toss is 1 shannon, whereas of m tosses it is m shannons. Generally, you need log2(n) bits to represent a variable that can take one of n values if n is a power of 2. If these values are equally probable, the entropy (in shannons) is equal to the number of bits. Equality between number of bits and shannons holds only while all outcomes are equally probable. If one of the events is more probable than others, observation of that event is less informative. Conversely, rarer events provide more information when observed. Since observation of less probable events occurs more rarely, the net effect is that the entropy (thought of as average information) received from non-uniformly distributed data is less than log2(n). Entropy is zero when one outcome is certain. Shannon entropy quantifies all these considerations exactly when a probability distribution of the source is known. The meaning of the events observed (the meaning of messages) does not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution, not the meaning of the events themselves.

Generally, entropy refers to disorder or uncertainty. Shannon entropy was introduced by Claude E. Shannon in his 1948 paper “A Mathematical Theory of Communication“.[1] Shannon entropy provides an absolute limit on the best possible average length of lossless encoding or compression of an information source. Rényi entropy generalizes Shannon entropy.

Definition

Named after Boltzmann’s Η-theorem, Shannon defined the entropy Η (Greek letter Eta) of a discrete random variable X with possible values {x1, …, xn} and probability mass function P(X) as:

\Eta(X) = \mathrm{E}[\mathrm{I}(X)] = \mathrm{E}[-\ln(\mathrm{P}(X))].

Here E is the expected value operator, and I is the information content of X.[4][5] I(X) is itself a random variable.

The entropy can explicitly be written as

\Eta(X) = \sum_{i=1}^n {\mathrm{P}(x_i)\,\mathrm{I}(x_i)} = -\sum_{i=1}^n {\mathrm{P}(x_i) \log_b \mathrm{P}(x_i)},

where b is the base of the logarithm used. Common values of b are 2, Euler’s number e, and 10, and the unit of entropy is shannon for b = 2, nat for b = e, and hartley for b = 10.[6] When b = 2, the units of entropy are also commonly referred to as bits.

In the case of p(xi) = 0 for some i, the value of the corresponding summand 0 logb(0) is taken to be 0, which is consistent with the limit:

\lim_{p\to0+}p\log (p) = 0.

When the distribution is continuous rather than discrete, the sum is replaced with an integral as

\Eta(X) = \int {\mathrm{P}(x)\,\mathrm{I}(x)} ~dx = -\int {\mathrm{P}(x) \log_b \mathrm{P}(x)} ~dx,

where P(x) represents a probability density function.

One may also define the conditional entropy of two events X and Y taking values xi and yj respectively, as

 \Eta(X|Y)=\sum_{i,j}p(x_{i},y_{j})\log\frac{p(y_{j})}{p(x_{i},y_{j})}

where p(xi, yj) is the probability that X = xi and Y = yj. This quantity should be understood as the amount of randomness in the random variable X given the event Y.

───

 

如果嘗試親讀夏農一九四八年的大著

A Mathematical Theory of Communication

Shannon

 

或可解惑矣。其中最大要點就在清楚明白『資訊函式』到底是什麼呢???

Rationale

To understand the meaning of pi log(1/pi), at first, try to define an information function, I, in terms of an event i with probability pi. How much information is acquired due to the observation of event i? Shannon’s solution follows from the fundamental properties of information:[7]

  1. I(p) ≥ 0 – information is a non-negative quantity
  2. I(1) = 0 – events that always occur do not communicate information
  3. I(p1 p2) = I(p1) + I(p2) – information due to independent events is additive

The last is a crucial property. It states that joint probability communicates as much information as two individual events separately. Particularly, if the first event can yield one of n equiprobable outcomes and another has one of m equiprobable outcomes then there are mn possible outcomes of the joint event. This means that if log2(n) bits are needed to encode the first value and log2(m) to encode the second, one needs log2(mn) = log2(m) + log2(n) to encode both. Shannon discovered that the proper choice of function to quantify information, preserving this additivity, is logarithmic, i.e.,

\mathrm{I}(p) = \log(1/p)

The base of the logarithm can be any fixed real number greater than 1. The different units of information (bits for log2, trits for log3, nats for the natural logarithm ln and so on) are just constant multiples of each other. (In contrast, the entropy would be negative if the base of the logarithm were less than 1.) For instance, in case of a fair coin toss, heads provides log2(2) = 1 bit of information, which is approximately 0.693 nats or 0.631 trits. Because of additivity, n tosses provide n bits of information, which is approximately 0.693n nats or 0.631n trits.

Now, suppose we have a distribution where event i can happen with probability pi. Suppose we have sampled it N times and outcome i was, accordingly, seen ni = N pi times. The total amount of information we have received is

\sum_i {n_i \mathrm{I}(p_i)} = \sum {N p_i \log(1/p_i)}.

The average amount of information that we receive with every event is therefore

\sum_i {p_i \log {1\over p_i}}.

───

 

彷彿早將『事件機率』看成了基本!!莫非越是『稀有』的事件p_1 \times  p_2 \times \cdots,『資訊』之內含 I(\frac {1}{p_1}) + I(\frac {1}{p_2}) + \cdots 越高的哩??讀者不要以為奇怪,此事無獨有偶,也許那代人都喜歡這麼論述也!!??就像『馮‧諾伊曼』之『理性溫度計』也是用『公設法』如是展開︰

當白努利提出了一個理論來解釋『聖彼得堡悖論』時,就開啟了『效用』 Utility 的大門︰

邊際效用遞減原理】:一個人對於『財富』的擁有多多益善,也就是說『效用函數U(w) 的一階導數大於零 \frac{dU(w)}{dw} > 0;但隨著『財富』的增加,『滿足程度』的積累速度卻是不斷下降,正因為『效用函數』之二階導數小於零 \frac{d^2U(w)}{dw^2} < 0

最大效用原理】:當人處於『風險』和『不確定』的條件下,一個人『理性決策』的『準則』是為著獲得最大化『期望效用』值而不是最大之『期望金額』值。

Utility』依據牛津大字典的『定義』是︰

The state of being useful, profitable, or beneficial:
(In game theory or economics) a measure of that which is sought to be maximized in any situation involving a choice.

如此『效用』一詞,不論代表的是哪種『喜好度』 ── 有用 useful 、有利 profitable 、滿足 Satisfaction 、愉快 Pleasure 、幸福 Happiness ──,都會涉及主觀的感覺,那麼真可以定出『尺度』的嗎?『效用函數』真的『存在』嗎??

170px-Pakkanen
溫度計
量冷熱

魯班尺
魯班尺
度吉凶

一九四七年,匈牙利之美籍猶太人數學家,現代電腦創始人之一。約翰‧馮‧諾伊曼 Jhon Von Neumann 和德國-美國經濟學家奧斯卡‧摩根斯特恩 Oskar Morgenstern 提出只要『個體』的『喜好性』之『度量』滿足『四條公設』,那麼『個體』之『效用函數』就『存在』,而且除了『零點』的『規定』,以及『等距長度』之『定義』之外,這個『效用函數』還可以說是『唯一』的。就像是『個體』隨身攜帶的『理性』之『溫度計』一樣,能在任何『選擇』下,告知最大『滿意度』與『期望值』。現今這稱之為『期望效用函數理論』 Expected Utility Theory。

由於每個人的『冷熱感受』不同,所以『溫度計』上的『刻度』並不是代表數學上的一般『數字』,通常這一種比較『尺度』只有『差距值』有相對『強弱』意義,『數值比值』並不代表什麼意義,就像說,攝氏二十度不是攝氏十度的兩倍熱。這一類『尺度』在度量中叫做『等距量表』 Interval scale 。

溫度計』量測『溫度』的『高低』,『理性』之『溫度計』度量『選擇』的『優劣』。通常在『實驗經濟學』裡最廣泛採取的是『彩票選擇實驗』 lottery- choice experiments,也就是講,請你在『眾多彩票』中選擇一個你『喜好』 的『彩票』。

這樣就可以將一個有多種『機率p_i,能產生互斥『結果A_i 的『彩票L 表示成︰

L = \sum \limits_{i=1}^{N} p_i A_i ,  \  \sum \limits_{i=1}^{N} p_i  =1,  \ i=1 \cdots N

如此『期望效用函數理論』之『四條公設』可以表示為︰

完整性公設】Completeness

L\prec MM\prec L,或 L \sim M

任意的兩張『彩票』都可以比較『喜好度』 ,它的結果只能是上述三種關係之一,『偏好 ML\prec M,『偏好 LM\prec L,『無差異L \sim M

遞移性公設】 Transitivity

如果 L \preceq M,而且 M \preceq N,那麼 L \preceq N

連續性公設】 Continuity

如果 L \preceq M\preceq N , 那麼存在一個『機率p\in[0,1] ,使得 pL + (1-p)N = M

獨立性公設】 Independence

如果 L\prec M, 那麼對任意的『彩票N 與『機率p\in(0,1],滿足 pL+(1-p)N \prec pM+(1-p)N

對於任何一個滿足上述公設的『理性經紀人』 rational agent ,必然可以『建構』一個『效用函數u,使得 A_i \rightarrow u(A_i),而且對任意兩張『彩票』,如果 L\prec M \Longleftrightarrow \  E(u(L)) < E(u(M))。此處 E(u(L)) 代表對 L彩票』的『效用期望值』,簡記作 Eu(L),符合

Eu(p_1 A_1 + \ldots + p_n A_n) = p_1 u(A_1) + \cdots + p_n u(A_n)

它在『微觀經濟學』、『博弈論』與『決策論』中,今天稱之為『預期效用假說』 Expected utility hypothesis,指在有『風險』的情況下,任何『個體』所應該作出的『理性選擇』就是追求『效用期望值』的『最大化』。假使人生中的『抉擇』真實能夠如是的『簡化』,也許想得到『快樂』與『幸福』的辦法,就清楚明白的多了。然而有人認為這個『假說』不合邏輯。一九五二年,法國總體經濟學家莫里斯‧菲力‧夏爾‧阿萊斯 Maurice Félix Charles Allais ── 一九八八年,諾貝爾經濟學獎的得主 ── 作了一個著名的實驗,看看實際上人到底是怎麼『做選擇』的,這個『阿萊斯』發明的『彩票選擇實驗』就是大名鼎鼎的『阿萊斯悖論』 Allais paradox 。

─── 摘自《物理哲學·下中…

 

這麼說來,某人擲 n 個硬幣,這一事件 0,1,1,0,0,0,1,1 \cdots ,此處 0 = 頭、1 =尾,將產生 I(2^{-n}) = \log_2(\frac{1}{2^{-n}} ) = n 比特資訊,會等於 n 個人各擲一硬幣之結果 n \times I(2^{-1}) = n \times \log_2(\frac{1}{2^{-1}} ) = n  ─── n 個一比特資訊 ── 實在是非常美妙的吧。

Entropy_flip_2_coins

2 shannons of entropy: Information entropy is the log-base-2 of the number of possible outcomes; with two coins there are four outcomes, and the entropy is two bits.

 

假使有人反對夏農之結論,就得反對他的『假設』了。

如果反面立論,假使依據夏農的觀點,一個幾乎篤定的事件 p \approx 1 ,它幾乎沒有資訊內容 I(p) = \log_2(\frac{1}{p}) \approx 0 的哩!如是對偶之 1-p ── 資訊『不確定性』 ── 也就自然成為一根『尺』矣!!

Binary entropy function

In information theory, the binary entropy function, denoted \operatorname H(p) or \operatorname H_\text{b}(p), is defined as the entropy of a Bernoulli process with probability of success p. Mathematically, the Bernoulli trial is modelled as a random variable X that can take on only two values: 0 and 1. The event X = 1 is considered a success and the event X = 0 is considered a failure. (These two events are mutually exclusive and exhaustive.)

If \operatorname{Pr}(X=1) = p, then \operatorname{Pr}(X=0) = 1-p and the entropy of X (in shannons) is given by

\operatorname H(X) = \operatorname H_\text{b}(p) = -p \log_2 p - (1 - p) \log_2 (1 - p),

where 0 \log_2 0 is taken to be 0. The logarithms in this formula are usually taken (as shown in the graph) to the base 2. See binary logarithm.

When p=\tfrac 1 2, the binary entropy function attains its maximum value. This is the case of the unbiased bit, the most common unit of information entropy.

\operatorname H(p) is distinguished from the entropy function \operatorname H(X) in that the former takes a single real number as a parameter whereas the latter takes a distribution or random variables as a parameter. Sometimes the binary entropy function is also written as \operatorname H_2(p). However, it is different from and should not be confused with the Rényi entropy, which is denoted as \operatorname H_2(X).

Binary_entropy_plot.svg

Entropy of a Bernoulli trial as a function of success probability, called the binary entropy function.

Explanation

In terms of information theory, entropy is considered to be a measure of the uncertainty in a message. To put it intuitively, suppose p=0. At this probability, the event is certain never to occur, and so there is no uncertainty at all, leading to an entropy of 0. If p=1, the result is again certain, so the entropy is 0 here as well. When p=1/2, the uncertainty is at a maximum; if one were to place a fair bet on the outcome in this case, there is no advantage to be gained with prior knowledge of the probabilities. In this case, the entropy is maximum at a value of 1 bit. Intermediate values fall between these cases; for instance, if p=1/4, there is still a measure of uncertainty on the outcome, but one can still predict the outcome correctly more often than not, so the uncertainty measure, or entropy, is less than 1 full bit.

───

 

事實上藉著

Logarithm

Integral representation of the natural logarithm

The natural logarithm of t equals the integral of 1/x dx from 1 to t:

\ln (t) = \int_1^t \frac{1}{x} \, dx.

601px-Natural_logarithm_integral.svg

The natural logarithm of t is the shaded area underneath the graph of the function f(x) = 1/x (reciprocal of x).

───

 

的定義 \ln(t) = \int_1^t \frac{1}{x} dx ,我們可以得到一個重要的不等式︰ \ln(t) \leq t - 1

【證明】

‧ 如果 t \geq 1 ,那麼 \frac{1}{t} \leq 1 ,所以

\int_1^t \frac{1}{x} dx \leq \int_1^t 1 dx = t -1

‧ 如果 t \leq 1 ,那麼 \frac{1}{t} \geq 1 ,所以

\int_t^1 \frac{1}{x} dx \geq \int_t^1 1 dx = 1-t ,因此

-\int_t^1 \frac{1}{x} dx = \int_1^t \frac{1}{x} dx \leq t - 1

 

此式直通『吉布斯不等式』的大門︰

Gibbs’ inequality

In information theory, Gibbs’ inequality is a statement about the mathematical entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs’ inequality, including Fano’s inequality. It was first presented by J. Willard Gibbs in the 19th century.

Gibbs’ inequality

Suppose that

 P = \{ p_1 , \ldots , p_n \}

is a probability distribution. Then for any other probability distribution

 Q = \{ q_1 , \ldots , q_n \}

the following inequality between positive quantities (since the pi and qi are positive numbers less than one) holds[1]:68

 - \sum_{i=1}^n p_i \log_2 p_i \leq - \sum_{i=1}^n p_i \log_2 q_i

with equality if and only if

 p_i = q_i \,

for all i. Put in words, the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q.

The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written:[2]:34

 D_{\mathrm{KL}}(P\|Q) \equiv \sum_{i=1}^n p_i \log_2 \frac{p_i}{q_i} \geq 0.

Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an “average surprisal” measured in bits.

Proof

Since

 \log_2 a = \frac{ \ln a }{ \ln 2 }

it is sufficient to prove the statement using the natural logarithm (ln). Note that the natural logarithm satisfies

 \ln x \leq x-1

for all x > 0 with equality if and only if x=1.

Let I denote the set of all i for which pi is non-zero. Then

- \sum_{i \in I} p_i \ln \frac{q_i}{p_i} \geq - \sum_{i \in I} p_i \left( \frac{q_i}{p_i} - 1 \right)
= - \sum_{i \in I} q_i + \sum_{i \in I} p_i
 = - \sum_{i \in I} q_i + 1 \geq 0.

So

 - \sum_{i \in I} p_i \ln q_i \geq - \sum_{i \in I} p_i \ln p_i

and then trivially

 - \sum_{i=1}^n p_i \ln q_i \geq - \sum_{i=1}^n p_i \ln p_i

since the right hand side does not grow, but the left hand side may grow or may stay the same.

For equality to hold, we require:

  1.  \frac{q_i}{p_i} = 1 for all i \in I so that the approximation \ln \frac{q_i}{p_i} = \frac{q_i}{p_i} -1 is exact.
  2.  \sum_{i \in I} q_i = 1 so that equality continues to hold between the third and fourth lines of the proof.

This can happen if and only if

p_i = q_i

for i = 1, …, n.

───

 

於是乎可得

- \sum \limits_{i=1}^{n} p_i \ ln(p_i) \leq - \sum \limits_{i=1}^{n} p_i \ ln(\frac{1}{n}) = ln(n)  \ \sum \limits_{i=1}^{n} p_i = ln(n)

誠令人驚訝也???