W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】七

『統計力學』與『資訊理論』終究不得不相會於『隨機性』以及『不確定性』之『機率』概念上!『最大熵』原理直衝牛斗︰

Principle of maximum entropy

The principle of maximum entropy states that, subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy.

Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. Of those, the one with maximal information entropy is the proper distribution, according to this principle.

History

The principle was first expounded by E. T. Jaynes in two papers in 1957[1][2] where he emphasized a natural correspondence between statistical mechanics and information theory. In particular, Jaynes offered a new and very general rationale why the Gibbsian method of statistical mechanics works. He argued that the entropy of statistical mechanics and the information entropy of information theory are principally the same thing. Consequently, statistical mechanics should be seen just as a particular application of a general tool of logical inference and information theory.

……

The Wallis derivation

The following argument is the result of a suggestion made by Graham Wallis to E. T. Jaynes in 1962.[8] It is essentially the same mathematical argument used for the Maxwell–Boltzmann statistics in statistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of ‘uncertainty’, ‘uninformativeness’, or any other imprecisely defined concept. The information entropy function is not assumed a priori, but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.

Suppose an individual wishes to make a probability assignment among m mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute N quanta of probability (each worth 1/N) at random among the m possibilities. (One might imagine that she will throw N balls into m buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. (For this step to be successful, the information must be a constraint given by an open set in the space of probability measures). If it is inconsistent, she will reject it and try again. If it is consistent, her assessment will be

p_i = \frac{n_i}{N}

where pi is the probability of the ith proposition, while ni is the number of quanta that were assigned to the ith proposition (i.e. the number of balls that ended up in bucket i).

Now, in order to reduce the ‘graininess’ of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,

Pr(\mathbf{p}) = W \cdot m^{-N}

where

W = \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!}

is sometimes known as the multiplicity of the outcome.

The most probable result is the one which maximizes the multiplicity W. Rather than maximizing W directly, the protagonist could equivalently maximize any monotonic increasing function of W. She decides to maximize

\begin{array}{rcl} \frac{1}{N}\log W &=& \frac{1}{N}\log \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!} \\ \\ &=& \frac{1}{N}\log \frac{N!}{(Np_1)! \, (Np_2)! \, \dotsb \, (Np_m)!} \\ \\ &=& \frac{1}{N}\left( \log N! - \sum_{i=1}^m \log ((Np_i)!) \right). \end{array}

At this point, in order to simplify the expression, the protagonist takes the limit as N\to\infty, i.e. as the probability levels go from grainy discrete values to smooth continuous values. Using Stirling’s approximation, she finds

\begin{array}{rcl} \lim_{N \to \infty}\left(\frac{1}{N}\log W\right) &=& \frac{1}{N}\left( N\log N - \sum_{i=1}^m Np_i\log (Np_i) \right) \\ \\ &=& \log N - \sum_{i=1}^m p_i\log (Np_i) \\ \\ &=& \log N - \log N \sum_{i=1}^m p_i - \sum_{i=1}^m p_i\log p_i \\ \\ &=& \left(1 - \sum_{i=1}^m p_i \right)\log N - \sum_{i=1}^m p_i\log p_i \\ \\ &=& - \sum_{i=1}^m p_i\log p_i \\ \\ &=& H(\mathbf{p}). \end{array}

All that remains for the protagonist to do is to maximize entropy under the constraints of her testable information. She has found that the maximum entropy distribution is the most probable of all “fair” random distributions, in the limit as the probability levels go from discrete to continuous.

───

 

這位傑尼斯

Edwin Thompson Jaynes

Edwin Thompson Jaynes (July 5, 1922 – April 30,[1] 1998) was the Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis. He wrote extensively on statistical mechanics and on foundations of probability and statistical inference, initiating in 1957 the MaxEnt interpretation of thermodynamics,[2][3] as being a particular application of more general Bayesian/information theory techniques (although he argued this was already implicit in the works of Gibbs). Jaynes strongly promoted the interpretation of probability theory as an extension of logic.

───

 

先生將波利亞之『似合理的』Plausible 『推理』系統化,

George_Pólya_ca_1973

喬治‧波利亞
George Pólya

How to Solve It

suggests the following steps when solving a mathematical problem:

1. First, you have to understand the problem.
2. After understanding, then make a plan.
3. Carry out the plan.
4. Look back on your work. How could it be better?

If this technique fails, Pólya advises: “If you can’t solve a problem, then there is an easier problem you can solve: find it.” Or: “If you cannot solve the proposed problem, try to solve first some related problem. Could you imagine a more accessible related problem?”

喬治‧波利亞長期從事數學教學,對數學思維的一般規律有深入的研究,一生推動數學教育。一九五四年,波利亞寫了兩卷不同於一般的數學書《Induction And_Analogy In Mathematics》與《Patterns Of Plausible Inference》探討『啟發式』之『思維樣態』,這常常是一種《數學發現》之切入點,也是探尋『常識徵候』中的『合理性』根源。舉個例子來說,典型的亞里斯多德式的『三段論』 syllogisms ︰

P \Longrightarrow Q
P 真, \therefore Q 真。

如果對比著『似合理的』Plausible 『推理』︰

P \Longrightarrow Q
Q 真, P 更可能是真。

這種『推理』一般稱之為『肯定後件Q 的『邏輯誤謬』。因為在『邏輯』上,這種『形式』的推導,並不『必然的』保障『歸結』一定是『』的。然而這種『推理形式』是完全沒有『道理』的嗎?如果從『三段論』之『邏輯』上來講,要是 Q 為『』,P 也就『必然的』為『』。所以假使 P 為『』之『必要條件Q 為『』,那麼 P 不該是『更可能』是『』的嗎??

─── 摘自《物理哲學·下中……

 

把『機率論』帶入邏輯殿堂,

Probability Theory: The Logic Of Science

he material available from this page is a pdf version of E.T. Jaynes’s book.

Introduction

Please note that the contents of the file from the link below is slightly of out sync with the actual contents of the book. The listing on this page correspond to the existing chapter order and names.

……

PT-1

 

PT-2

───

 

誠非偶然的耶!!??若是人人都能如是『運作理則』,庶幾可免『賭徒謬誤』矣??!!

Gambler’s fallacy

The gambler’s fallacy, also known as the Monte Carlo fallacy or the fallacy of the maturity of chances, is the mistaken belief that, if something happens more frequently than normal during some period, it will happen less frequently in the future, or that, if something happens less frequently than normal during some period, it will happen more frequently in the future (presumably as a means of balancing nature). In situations where what is being observed is truly random (i.e., independent trials of a random process), this belief, though appealing to the human mind, is false. This fallacy can arise in many practical situations although it is most strongly associated with gambling where such mistakes are common among players.

The use of the term Monte Carlo fallacy originates from the most famous example of this phenomenon, which occurred in a Monte Carlo Casino in 1913.[1][better source needed][2]

……

Examples

Coin toss

The gambler’s fallacy can be illustrated by considering the repeated toss of a fair coin. With a fair coin, the outcomes in different tosses are statistically independent and the probability of getting heads on a single toss is exactly 1/2 (one in two). It follows that the probability of getting two heads in two tosses is 1/4 (one in four) and the probability of getting three heads in three tosses is 1/8 (one in eight). In general, if we let Ai be the event that toss i of a fair coin comes up heads, then we have,

\Pr\left(\bigcap_{i=1}^n A_i\right)=\prod_{i=1}^n \Pr(A_i)={1\over2^n}.

Now suppose that we have just tossed four heads in a row, so that if the next coin toss were also to come up heads, it would complete a run of five successive heads. Since the probability of a run of five successive heads is only 1/32 (one in thirty-two), a person subject to the gambler’s fallacy might believe that this next flip was less likely to be heads than to be tails. However, this is not correct, and is a manifestation of the gambler’s fallacy; the event of 5 heads in a row and the event of “first 4 heads, then a tails” are equally likely, each having probability 1/32. Given that the first four tosses turn up heads, the probability that the next toss is a head is in fact,

\Pr\left(A_5|A_1 \cap A_2 \cap A_3 \cap A_4 \right)=\Pr\left(A_5\right)=\frac{1}{2}.

While a run of five heads is only 1/32 = 0.03125, it is only that before the coin is first tossed. After the first four tosses the results are no longer unknown, so their probabilities are 1. Reasoning that it is more likely that the next toss will be a tail than a head due to the past tosses, that a run of luck in the past somehow influences the odds in the future, is the fallacy.

 

Lawoflargenumbersanimation2

Simulation of coin tosses: Each frame, a coin is flipped which is red on one side and blue on the other. The result of each flip is added as a colored dot in the corresponding column. As the pie chart shows, the proportion of red versus blue approaches 50-50 (the law of large numbers). But the difference between red and blue does not systematically decrease to zero.

───