W!o+ 的《小伶鼬工坊演義》︰神經網絡【Softmax】一

無事不登三寶殿,無因何故寫文章? Michael Nielsen 先生行文筆法真不可逆料也︰

Softmax

In this chapter we’ll mostly use the cross-entropy cost to address the problem of learning slowdown. However, I want to briefly describe another approach to the problem, based on what are called softmax layers of neurons. We’re not actually going to use softmax layers in the remainder of the chapter, so if you’re in a great hurry, you can skip to the next section. However, softmax is still worth understanding, in part because it’s intrinsically interesting, and in part because we’ll use softmax layers in Chapter 6, in our discussion of deep neural networks.

The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs* *In describing the softmax we’ll make frequent use of notation introduced in the last chapter. You may wish to revisit that chapter if you need to refresh your memory about the meaning of the notation. z^L_j = \sum_{k} w^L_{jk} a^{L-1}_k + b^L_j. However, we don’t apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called softmax function to the z^L_j. According to this function, the activation a^L_j of the jth output neuron is

a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}, \ \ \ \ (78)

where in the denominator we sum over all the output neurons.

If you’re not familiar with the softmax function, Equation (78) may look pretty opaque. It’s certainly not obvious why we’d want to use this function. And it’s also not obvious that this will help us address the learning slowdown problem. To better understand Equation (78), suppose we have a network with four output neurons, and four corresponding weighted inputs, which we’ll denote z^L_1, z^L_2, z^L_3, and z^L_4. Shown below are adjustable sliders showing possible values for the weighted inputs, and a graph of the corresponding output activations. A good place to start exploration is by using the bottom slider to increase z^L_4:

softmax

As you increase z^L_4, you’ll see an increase in the corresponding output activation, a^L_4, and a decrease in the other output activations. Similarly, if you decrease z^L_4 then a^L_4 will decrease, and all the other output activations will increase. In fact, if you look closely, you’ll see that in both cases the total change in the other activations exactly compensates for the change in a^L_4. The reason is that the output activations are guaranteed to always sum up to 1, as we can prove using Equation (78) and a little algebra:

\sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \ \ \ \ (79)

As a result, if a^L_4 increases, then the other output activations must decrease by the same total amount, to ensure the sum over all activations remains 1. And, of course, similar statements hold for all the other activations.

Equation (78) also implies that the output activations are all positive, since the exponential function is positive. Combining this with the observation in the last paragraph, we see that the output from the softmax layer is a set of positive numbers which sum up to 1. In other words, the output from the softmax layer can be thought of as a probability distribution.

The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it’s convenient to be able to interpret the output activation a^L_j as the network’s estimate of the probability that the correct output is j. So, for instance, in the MNIST classification problem, we can interpret a^L_j as the network’s estimated probability that the correct digit classification is j.

By contrast, if the output layer was a sigmoid layer, then we certainly couldn’t assume that the activations formed a probability distribution. I won’t explicitly prove it, but it should be plausible that the activations from a sigmoid layer won’t in general form a probability distribution. And so with a sigmoid output layer we don’t have such a simple interpretation of the output activations.

 

若說有說『Softmax』是什麼?僅只有『定義』而已!彷彿沒說??若說沒說?還有個『動態模擬』可以玩玩!!考之維基百科詞條︰

Softmax function

In mathematics, in particular probability theory and related fields, the softmax function, or normalized exponential,[1]:198 is a generalization of the logistic function that “squashes” a K-dimensional vector \mathbf{z} of arbitrary real values to a K-dimensional vector \sigma(\mathbf{z}) of real values in the range (0, 1) that add up to 1. The function is given by

\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}    for j = 1, …, K.

The softmax function is the gradient-log-normalizer of the categorical probability distribution. For this reason, the softmax function is used in various probabilistic multiclass classification methods including multinomial logistic regression,[1]:206–209 multiclass linear discriminant analysis, naive Bayes classifiers and artificial neural networks.[2] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the j‘th class given a sample vector x is:

P(y=j|\mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}

This can be seen as the composition of K linear functions \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K and the softmax function (where \mathbf{x}^\mathsf{T}\mathbf{w} denotes the inner product of \mathbf{x} and \mathbf{w}).

Artificial neural networks

In neural network simulations, the softmax function is often implemented at the final layer of a network used for classification. Such networks are then trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account:

 \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \dots = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k))

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

See Multinomial logit for a probability model which uses the softmax activation function.

Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[3]

 P_t(a) = \frac{\exp(q_t(a)/\tau)}{\sum_{i=1}^n\exp(q_t(i)/\tau)} \text{,}

where the action value q_t(a) corresponds to the expected reward of following action a and \tau is called a temperature parameter (in allusion to statistical mechanics). For high temperatures (\tau\to \infty), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (\tau\to 0^+), the probability of the action with the highest expected reward tends to 1.

───

 

或可得部份習題解答,

Exercise

  • Construct an example showing explicitly that in a network with a sigmoid output layer, the output activations a^L_j won’t always sum to 1.

We’re starting to build up some feel for the softmax function and the way softmax layers behave. Just to review where we’re at: the exponentials in Equation (78) ensure that all the output activations are positive. And the sum in the denominator of Equation (78) ensures that the softmax outputs sum to 1. So that particular form no longer appears so mysterious: rather, it is a natural way to ensure that the output activations form a probability distribution. You can think of softmax as a way of rescaling the z^L_j, and then squishing them together to form a probability distribution.

Exercises

  • Monotonicity of softmax Show that \partial a^L_j / \partial z^L_k is positive if j = k and negative if j \neq k. As a consequence, increasing z^L_j is guaranteed to increase the corresponding output activation, a^L_j, and will decrease all the other output activations. We already saw this empirically with the sliders, but this is a rigorous proof.
  • Non-locality of softmax A nice thing about sigmoid layers is that the output a^L_j is a function of the corresponding weighted input, a^L_j = \sigma(z^L_j). Explain why this is not the case for a softmax layer: any particular output activation aLj depends on all the weighted inputs.

Problem

  • Inverting the softmax layer Suppose we have a neural network with a softmax output layer, and the activations a^L_j are known. Show that the corresponding weighted inputs have the form z^L_j = \ln a^L_j + C, for some constant C that is independent of j.

 

或當知它與『配分函數』淵源匪淺︰

配分函數英語:Partition function)是一個平衡態統計物理學中經常應用到的概念,經由計算配分函數可以將微觀物理狀態與宏觀物理量相互聯繫起來,而配分函數等價於自由能,與路徑積分數學上有巧妙的類似。

Partition function (statistical mechanics)

In physics, a partition function describes the statistical properties of a system in thermodynamic equilibrium. Partition functions are functions of the thermodynamic state variables, such as the temperature and volume. Most of the aggregate thermodynamic variables of the system, such as the total energy, free energy, entropy, and pressure, can be expressed in terms of the partition function or its derivatives.

Each partition function is constructed to represent a particular statistical ensemble (which, in turn, corresponds to a particular free energy). The most common statistical ensembles have named partition functions. The canonical partition function applies to a canonical ensemble, in which the system is allowed to exchange heat with the environment at fixed temperature, volume, and number of particles. The grand canonical partition function applies to a grand canonical ensemble, in which the system can exchange both heat and particles with the environment, at fixed temperature, volume, and chemical potential. Other types of partition functions can be defined for different circumstances; see partition function (mathematics) for generalizations. The partition function has many physical meanings, as discussed in Meaning and significance.

Definition

As a beginning assumption, assume that a thermodynamically large system is in thermal contact with the environment, with a temperature T, and both the volume of the system and the number of constituent particles are fixed. This kind of system is called a canonical ensemble. The appropriate mathematical expression for the canonical partition function depends on the degrees of freedom of the system, whether the context is classical mechanics or quantum mechanics, and whether the spectrum of states is discrete or continuous.

Classical discrete system

For a canonical ensemble that is classical and discrete, the canonical partition function is defined as

 Z = \sum_{i} \mathrm{e}^{- \beta E_i}

where

 i is the index for the microstates of the system,
 \beta is the thermodynamic beta defined as  \tfrac{1}{k_B T} ,
 E_i is the total energy of the system in the respective microstate.

 

畢竟『指數』與『對數』本是一家親的耶??!!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】十

派生碼訊

子 鼠

最高樓‧杜甫

城尖徑厭施篩愁,獨立縹緲之飛樓。
峽坼雲霾龍虎臥,江清日抱黿鼉游。
扶桑西枝對斷石,弱水東影隨長流。
杖藜嘆世者誰子?泣血迸空回白頭。

︰老杜倡議拗體詩,誰知拗得拗不得?

今天小企鵝學堂上,吵吵鬧鬧,此時那些小企鵝們正憤愾的議論著禮拜天還『補課』之事的呢 !!

有的主張︰既然是天放假,幹嘛要補課。

有的辯證︰就算今天不補,改天補還不是一樣?

有的議論︰早補早了,有什麼好議論的!!

有的搞笑︰『始』安息日,『終』禮拜天!不就一天嗎??

那你可有說法?

哦!我唱ㄏㄡˇ你ㄗㄞ︰今天不只要補課,……

而且ㄍㄡˋ要實習,但是用的是『咸澤碼訊』

‧高級的

‧沒學過的

‧ㄍㄚˋ伊講不通的

唔是□派○生的呦!!

啞是甲意喝一杯咖啡??

BabyTux

graphics-tux-638974

sonictux

starbucks_tux_linux_art-555px

─── 摘自《M♪o 之學習筆記本《子》開關︰【䷝】狀態編碼

 

補課不易加課難,惟因學堂不好玩。

授業解惑苦失傳,道鎖高樓難復還。

讀書原是本家事,學問方田自耕耘。為探『交叉熵』從何起???穿梭尋索字海中,終是抵

Using the cross-entropy to classify MNIST digits

The cross-entropy is easy to implement as part of a program which learns using gradient descent and backpropagation. We’ll do that later in the chapter, developing an improved version of our earlier program for classifying the MNIST handwritten digits, network.py. The new program is called network2.py, and incorporates not just the cross-entropy, but also several other techniques developed in this chapter* *The code is available on GitHub.. For now, let’s look at how well our new program classifies MNIST digits. As was the case in Chapter 1, we’ll use a network with 30 hidden neurons, and we’ll use a mini-batch size of 10. We set the learning rate to \eta = 0.5*

*In Chapter 1 we used the quadratic cost and a learning rate of \eta = 3.0. As discussed above, it’s not possible to say precisely what it means to use the “same” learning rate when the cost function is changed. For both cost functions I experimented to find a learning rate that provides near-optimal performance, given the other hyper-parameter choices.

There is, incidentally, a very rough general heuristic for relating the learning rate for the cross-entropy and the quadratic cost. As we saw earlier, the gradient terms for the quadratic cost have an extra \sigma' = \sigma(1-\sigma) term in them. Suppose we average this over values for \sigma, \int_0^1 d\sigma \sigma(1-\sigma) = 1/6. We see that (very roughly) the quadratic cost learns an average of 6 times slower, for the same learning rate. This suggests that a reasonable starting point is to divide the learning rate for the quadratic cost by 6. Of course, this argument is far from rigorous, and shouldn’t be taken too seriously. Still, it can sometimes be a useful starting point.

and we train for 30 epochs. The interface to network2.py is slightly different than network.py, but it should still be clear what is going on. You can, by the way, get documentation about network2.py‘s interface by using commands such as help(network2.Network.SGD) in a Python shell.

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,
... monitor_evaluation_accuracy=True)

Note, by the way, that the net.large_weight_initializer() command is used to initialize the weights and biases in the same way as described in Chapter 1. We need to run this command because later in this chapter we’ll change the default weight initialization in our networks. The result from running the above sequence of commands is a network with 95.49 percent accuracy. This is pretty close to the result we obtained in Chapter 1, 95.42 percent, using the quadratic cost.

……

What does the cross-entropy mean? Where does it come from?

Our discussion of the cross-entropy has focused on algebraic analysis and practical implementation. That’s useful, but it leaves unanswered broader conceptual questions, like: what does the cross-entropy mean? Is there some intuitive way of thinking about the cross-entropy? And how could we have dreamed up the cross-entropy in the first place?

Let’s begin with the last of these questions: what could have motivated us to think up the cross-entropy in the first place? Suppose we’d discovered the learning slowdown described earlier, and understood that the origin was the \sigma'(z) terms in Equations (55) and (56). After staring at those equations for a bit, we might wonder if it’s possible to choose a cost function so that the \sigma'(z) term disappeared. In that case, the cost C = C_x for a single training example x would satisfy

\frac{\partial C}{\partial w_j} & = & x_j(a-y) \ \ \ \ (71)
\frac{\partial C}{\partial b } & = & (a-y). \ \ \ \ (72)

If we could choose the cost function to make these equations true, then they would capture in a simple way the intuition that the greater the initial error, the faster the neuron learns. They’d also eliminate the problem of a learning slowdown. In fact, starting from these equations we’ll now show that it’s possible to derive the form of the cross-entropy, simply by following our mathematical noses. To see this, note that from the chain rule we have

\frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} \sigma'(z). \ \ \ \ (73)

Using \sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a) the last equation becomes

\frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} a(1-a). \ \ \ \ (74)

Comparing to Equation (72) we obtain

\frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}. \ \ \ \ (75)

Integrating this expression with respect to a gives

C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant}, \ \ \ \ (76)

for some constant of integration. This is the contribution to the cost from a single training example, x. To get the full cost function we must average over training examples, obtaining

C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant}, \ \ \ \ (77)

where the constant here is the average of the individual constants for each training example. And so we see that Equations (71) and (72) uniquely determine the form of the cross-entropy, up to an overall constant term. The cross-entropy isn’t something that was miraculously pulled out of thin air. Rather, it’s something that we could have discovered in a simple and natural way.

What about the intuitive meaning of the cross-entropy? How should we think about it? Explaining this in depth would take us further afield than I want to go. However, it is worth mentioning that there is a standard way of interpreting the cross-entropy that comes from the field of information theory. Roughly speaking, the idea is that the cross-entropy is a measure of surprise. In particular, our neuron is trying to compute the function x \rightarrow y = y(x). But instead it computes the function x \rightarrow a = a(x). Suppose we think of a as our neuron’s estimated probability that y is 1, and 1-a is the estimated probability that the right value for y is 0. Then the cross-entropy measures how “surprised” we are, on average, when we learn the true value for y. We get low surprise if the output is what we expect, and high surprise if the output is unexpected. Of course, I haven’t said exactly what “surprise” means, and so this perhaps seems like empty verbiage. But in fact there is a precise information-theoretic way of saying what is meant by surprise. Unfortunately, I don’t know of a good, short, self-contained discussion of this subject that’s available online. But if you want to dig deeper, then Wikipedia contains a brief summary that will get you started down the right track. And the details can be filled in by working through the materials about the Kraft inequality in chapter 5 of the book about information theory by Cover and Thomas.

──

 

欲解玄秘,尚須了『相對熵』之精義也!!!

Kullback–Leibler divergence

Definition

For discrete probability distributions P and Q, the Kullback–Leibler divergence of Q from P is defined[5] to be

D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \log\frac{P(i)}{Q(i)}.

In words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P. The Kullback–Leibler divergence is defined only if Q(i)=0 implies P(i)=0, for all i (absolute continuity). Whenever P(i) is zero the contribution of the i-th term is interpreted as zero because \lim_{x \to 0} x \log(x) = 0.

For distributions P and Q of a continuous random variable, the Kullback–Leibler divergence is defined to be the integral:[6]

D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^\infty p(x) \, \log\frac{p(x)}{q(x)} \, {\rm d}x, \!

where p and q denote the densities of P and Q.

More generally, if P and Q are probability measures over a set X, and P is absolutely continuous with respect to Q, then the Kullback–Leibler divergence from P to Q is defined as

 D_{\mathrm{KL}}(P\|Q) = \int_X \log\frac{{\rm d}P}{{\rm d}Q} \, {\rm d}P, \!

where \frac{{\rm d}P}{{\rm d}Q} is the Radon–Nikodym derivative of P with respect to Q, and provided the expression on the right-hand side exists. Equivalently, this can be written as

 D_{\mathrm{KL}}(P\|Q) = \int_X \log\!\left(\frac{{\rm d}P}{{\rm d}Q}\right) \frac{{\rm d}P}{{\rm d}Q} \, {\rm d}Q,

which we recognize as the entropy of P relative to Q. Continuing in this case, if \mu is any measure on X for which p = \frac{{\rm d}P}{{\rm d}\mu} and q = \frac{{\rm d}Q}{{\rm d}\mu} exist (meaning that p and q are absolutely continuous with respect to \mu), then the Kullback–Leibler divergence from P to Q is given as

 D_{\mathrm{KL}}(P\|Q) = \int_X p \, \log \frac{p}{q} \, {\rm d}\mu. \!

The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base e if information is measured in nats. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm.

Various conventions exist for referring to DKL(PQ) in words. Often it is referred to as the divergence between P and Q; however this fails to convey the fundamental asymmetry in the relation. Sometimes it may be found described as the divergence of P from, or with respect to Q (often in the context of relative entropy, or information gain). However, in the present article the divergence of Q from P will be the language used, as this best relates to the idea that it is P that is considered the underlying “true” or “best guess” distribution, that expectations will be calculated with reference to, while Q is some divergent, less good, approximate distribution.

……

Motivation

In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value xi out of a set of possibilities X can be seen as representing an implicit probability distribution q(xi)=2li over X, where li is the length of the code for xi in bits. Therefore, the Kullback–Leibler divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P.

 \begin{matrix} D_{\mathrm{KL}}(P\|Q) & = &-\sum_x p(x) \log q(x)& + & \sum_x p(x) \log p(x) \\[0.5em] & = & H(P,Q) & - & H(P)\, \! \end{matrix}

where H(P,Q) is the cross entropy of P and Q, and H(P) is the entropy of P.

Note also that there is a relation between the Kullback–Leibler divergence and the “rate function” in the theory of large deviations.[9][10]

KL-Gauss-Example

Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian distributions. Note the typical asymmetry for the Kullback–Leibler divergence is clearly visible.

……

Discrimination information

The Kullback–Leibler divergence DKL( p(x|H1) ‖ p(x|H0) ) can also be interpreted as the expected discrimination information for H1 over H0: the mean information per sample for discriminating in favor of a hypothesis H1 against a hypothesis H0, when hypothesis H1 is true.[16] Another name for this quantity, given to it by I.J. Good, is the expected weight of evidence for H1 over H0 to be expected from each sample.

The expected weight of evidence for H1 over H0 is not the same as the information gain expected per sample about the probability distribution p(H) of the hypotheses,

D_\mathrm{KL}( p(x|H_1) \| p(x|H_0) ) \neq IG = D_\mathrm{KL}( p(H|x) \| p(H|I) ).

Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination information

The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution f should be chosen which is as hard to discriminate from the original distribution f0 as possible; so that the new data produces as small an information gain DKL( ff0 ) as possible.

For example, if one had a prior distribution p(x,a) over x and a, and subsequently learnt the true distribution of a was u(a), the Kullback–Leibler divergence between the new joint distribution for x and a, q(x|a) u(a), and the earlier prior distribution would be:

D_\mathrm{KL}(q(x|a)u(a)\|p(x,a)) = \operatorname{E}_{u(a)}\{D_\mathrm{KL}(q(x|a)\|p(x|a))\} + D_\mathrm{KL}(u(a)\|p(a)),

i.e. the sum of the Kullback–Leibler divergence of p(a) the prior distribution for a from the updated distribution u(a), plus the expected value (using the probability distribution u(a)) of the Kullback–Leibler divergence of the prior conditional distribution p(x|a) from the new conditional distribution q(x|a). (Note that often the later expected value is called the conditional Kullback–Leibler divergence (or conditional relative entropy) and denoted by DKL(q(x|a)‖p(x|a))[17]) This is minimized if q(x|a) = p(x|a) over the whole support of u(a); and we note that this result incorporates Bayes’ theorem, if the new distribution u(a) is in fact a δ function representing certainty that a has one particular value.

MDI can be seen as an extension of Laplace‘s Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the Kullback–Leibler divergence continues to be just as relevant.

In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising the Kullback–Leibler divergence of m from p with respect to m is equivalent to minimizing the cross-entropy of p and m, since

H(p,m) = H(p) + D_{\mathrm{KL}}(p\|m),

which is appropriate if one is trying to choose an adequate approximation to p. However, this is just as often not the task one is trying to achieve. Instead, just as often it is m that is some fixed prior reference measure, and p that one is attempting to optimise by minimising DKL(pm) subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be DKL(pm), rather than H(p,m).

……

Relationship to available work

Surprisals[18] add where probabilities multiply. The surprisal for an event of probability p is defined as s=k \ln(1 / p). If k is \{ 1, 1/\ln 2, 1.38\times 10^{-23}\} then surprisal is in \{nats, bits, or J/K\} so that, for instance, there are N bits of surprisal for landing all “heads” on a toss of N coins.

Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal S (entropy) for a given set of control parameters (like pressure P or volume V). This constrained entropy maximization, both classically[19] and quantum mechanically,[20] minimizes Gibbs availability in entropy units[21] A\equiv -k \ln Z where Z is a constrained multiplicity or partition function.

When temperature T is fixed, free energy (T \times A) is also minimized. Thus if T, V and number of molecules N are constant, the Helmholtz free energy F\equiv U-TS (where U is energy) is minimized as a system “equilibrates.” If T and P are held constant (say during processes in your body), the Gibbs free energy G=U+PV-TS is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature T_o and pressure P_o is W = \Delta G =NkT_o \Theta(V/V_o) where V_o = NkT_o/P_o and \Theta(x)=x-1-\ln x\ge 0 (see also Gibbs inequality).

More generally[22] the work available relative to some ambient is obtained by multiplying ambient temperature T_o by Kullback–Leibler divergence or net surprisal \Delta I\ge 0, defined as the average value of k\ln(p/p_o) where p_o is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of V_o and T_o is thus W=T_o \Delta I, where Kullback–Leibler divergence \Delta I = Nk[\Theta(V/V_o)+\frac{3}{2}\Theta(T/T_o)]. The resulting contours of constant Kullback–Leibler divergence, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.[23] Thus Kullback–Leibler divergence measures thermodynamic availability in bits.

ArgonKLdivergence

Pressure versus volume plot of available work from a mole of Argon gas relative to ambient, calculated as T_o times the Kullback–Leibler divergence.

───

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】九

周易《繫辭上

一陰一陽之謂道,繼之者善也,成之者性也。仁者見之謂之仁,知者見之謂之知。百姓日用而不知,故君子之道鮮矣。顯諸仁,藏諸用,鼓萬物而不與聖人同懮,盛德大業至矣哉。富有之謂大業,日新之謂盛德。生生之謂易,成象之謂乾,效法之為坤,極數知來之謂占,通變之謂事,陰陽不測之謂神。

 

若問易經能解『夏農熵』嗎?

易經》是中國最古老的文獻之一,並被儒家尊為「五經」之首;一般說上古三大奇書包括《黃帝內經》、《易經》、《山海經》,但它們成書都較晚。《易經》以一套符號系統來描述狀態的簡易、變易、不易,表現了中國古典文化的哲學和宇宙觀。它的中心思想,是以陰陽的交替變化描述世間萬物。《易經》最初用於占卜,但它的影響遍及中國的哲學宗教醫學天文算術文學音樂藝術軍事武術等各方面。自從17世紀開始,《易經》也被介紹到西方。在四庫全書中為經部,十三經中未經秦始皇焚書之害 ,它是最早哲學書

───

 

『三畫』則八卦成列,『六爻』現六十四卦圖象 ,此『三畫六爻』儼然是『比特值』耶!所謂『陰陽不測之謂神』,實寫『真隨機』乎!!??

Randomness

Randomness is the lack of pattern or predictability in events.[1] A random sequence of events, symbols or steps has no order and does not follow an intelligible pattern or combination. Individual random events are by definition unpredictable, but in many cases the frequency of different outcomes over a large number of events (or “trials”) is predictable. For example, when throwing two dice, the outcome of any particular roll is unpredictable, but a sum of 7 will occur twice as often as 4. In this view, randomness is a measure of uncertainty of an outcome, rather than haphazardness, and applies to concepts of chance, probability, and information entropy.

The fields of mathematics, probability, and statistics use formal definitions of randomness. In statistics, a random variable is an assignment of a numerical value to each possible outcome of an event space. This association facilitates the identification and the calculation of probabilities of the events. Random variables can appear in random sequences. A random process is a sequence of random variables whose outcomes do not follow a deterministic pattern, but follow an evolution described by probability distributions. These and other constructs are extremely useful in probability theory and the various applications of randomness.

Randomness is most often used in statistics to signify well-defined statistical properties. Monte Carlo methods, which rely on random input (such as from random number generators or pseudorandom number generators), are important techniques in science, as, for instance, in computational science.[2] By analogy, quasi-Monte Carlo methods use quasirandom number generators.

RandomBitmap

A pseudorandomly generated bitmap.

 

今人尚且難以捉摸 \pi 是否為『正規數』

Normal number

In mathematics, a normal number is a real number whose infinite sequence of digits in every base b[1] is distributed uniformly in the sense that each of the b digit values has the same natural density 1/b, also all possible b2 pairs of digits are equally likely with density b−2, all b3 triplets of digits equally likely with density b−3, etc.

Intuitively this means that no digit, or (finite) combination of digits, occurs more frequently than any other, and this is true whether the number is written in base 10, binary, or any other base. A normal number can be thought of as an infinite sequence of coin flips (binary) or rolls of a die (base 6). Even though there will be sequences such as 10, 100, or more consecutive tails (binary) or fives (base 6) or even 10, 100, or more repetitions of a sequence such as tail-head (two consecutive coin flips) or 6-1 (two consecutive rolls of a die), there will also be equally many of any other sequence of equal length. No digit or sequence is “favored”.

While a general proof can be given that almost all real numbers are normal (in the sense that the set of exceptions has Lebesgue measure zero), this proof is not constructive and only very few specific numbers have been shown to be normal. For example, Chaitin’s constant is normal. It is widely believed that the numbers √2, π, and e are normal, but a proof remains elusive.

───

 

\pi 之訝異公式已現身︰

貝利-波爾溫-普勞夫公式

貝利-波爾溫-普勞夫公式BBP公式)提供了一個計算圓周率π的第n二進位數的spigot算法spigot algorithm)。這個求和公式是在1995年由西蒙·普勞夫提出的,並以公布這個公式的論文作者大衛 ·貝利David H. Bailey)、皮特·波爾溫Peter Borwein)和普勞夫的名字命名。在論文發表之前,普勞夫已將此公式在他的網站上公布[1][2]。這個公式是:

 \pi = \sum_{k = 0}^{\infty}\left[ \frac{1}{16^k} \left( \frac{4}{8k + 1} - \frac{2}{8k + 4} - \frac{1}{8k + 5} - \frac{1}{8k + 6} \right) \right]

這個公式的發現曾震驚學界。數百年來,求出π的第n位小數而不求出它的前n-1位曾被認為是不可能的。

自從這個發現以來,發現了更多的無理數常數的類似公式,它們都有一個類似的形式:

\alpha = \sum_{k = 0}^{\infty}\left[ \frac{1}{b^k} \frac{p(k)}{q(k)} \right]

其中α是目標常數,pq是整係數多項式b ≥ 2是整數的數制

這種形式的公式被稱為BBP式公式(BBP-type formulas)[3]。由特定的p,qb可組合出一些著名的常數。但至今尚未找出一種系統的算法來尋找合適的組合,而已知的公式多是通過實驗數學得出的。

───

 

此事是否能當真 ─── 十六進位制之 \pi !!??

『The Quest for Pi』給證明︰

TheQuestForPi

……

BBP-Pi-1

 

BBP-Pi-2

 

※ 註

\int \limits_0^1 \frac{4y}{y^2 - 2} dy = \int \limits_0^1 \frac{-4y}{2 - y^2} dy = 2 \ln (2 - y^2) \Big |_0^1

\int \limits_0^1 \frac{4y - 8}{y^2 - 2y + 2} dy = \int \limits_0^1 \frac{4y - 4}{y^2 - 2y + 2} dy - \int \limits_0^1 \frac{4}{{(y-1)}^2 + 1} dy = 2 \ln (y^2-2y + 2) \Big |_0^1 - 4 \arctan(y - 1) \Big |_0^1

\therefore \int \limits_0^1 \frac{4y}{y^2 - 2} dy - \int \limits_0^1 \frac{4y - 8}{y^2 - 2y + 2} dy = \pi

───

 

如是『亂數』是什麼乙事豈可不深思熟慮也???

創世紀』第十三章『亞伯蘭』以起先『築壇的地方』分別了『左‧右』,讓『羅得』來『選擇』。這就是今天稱之為『一分‧一擇』 I cut,  you choose 的『公平分享』規範。舉例說,一人以他所認為的『公平』切蛋糕,讓另一人先作『選擇』。這也就是 Bruno de Finetti 所講的︰由此方來設定輸贏『前提』之『賠率』和『賭注』,讓彼方決定購買『前提』之『正反方』一樣。

那麼 Bruno de Finetti 所堅持的『主觀機率論』是什麼呢?他認為『機率』就是一個人對某『事件』發生之『相信度』評估,這是由那個人的『知識』、『經驗』以及『資訊』等等來決定。比方講,假使問多個人『印象派大師莫內的生日是十一月十五號的機率是多少?』。不知道『莫內是誰』的,可能認為是 \frac{1}{365};過去聽說

300px-Claude_Monet,_Impression,_soleil_levant,_1872

克勞德‧莫內

印象‧日出

自己跟『莫內同星座』的也許以為 \frac{1}{30};還有一個上網『谷歌』 Google 的說『機率是零』。所謂的『客觀機率』真的是存在的嗎?因此 Bruno de Finetti 的論點,自有不可忽視的重要性,更由於『量子力學』的『量測理論』將『觀察者』放進了『不確定性』框架中,這個『主‧客觀』的爭論,目前勢將持續進行下去的吧!在此僅用『亂數產生器』的『擬似』 Pseudo 與『真實』 Real 之說來看,人們真的有『判準』來『區分』這兩者的嗎?比方講,現今所相信的『真實』之『亂數產生器』來自於那些『隨機性』的『物理現象』;常用之『擬似』的『亂數產生器』可以從某種『計算式X_{n+1} = (a X_n + b) \ \textrm{mod} \ m 裡得到。雖然說『已知』之『演算法』在『夠長的產生序列』後,難免於『重複再現』,要是真存在一個『演算法』,它的『再現所需時間』是『千百億年』的呢?那我們能『發現』它是有『公式』的嗎??

Solar_eclipse_1999_4_NR

過去『存在主義者』曾經議論說︰如果講『上帝』與『魔鬼』都具有『超越人』之『大能』,當下聽聞『敲門聲』,身為一個『』,你又怎麽能夠知道『敲門者』是『上帝』還是『魔鬼』的呢!!

─── 摘自《物理哲學·下中+‧

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】八

晚遊六橋待月記》 明‧袁宏道

西湖最盛,為春為月。一日之盛,為朝煙,為夕嵐。今歲春雪甚盛 ,梅花為寒所勒,與杏桃相次開發,尤為奇觀。石篑數為余言:「傅金吾園中梅,張功甫玉照堂故物也,急往觀之。」余時為桃花所戀,竟不忍去湖上。

由斷橋至蘇隄一帶,綠煙紅霧,瀰漫二十餘里。歌吹為風,粉汗為雨,羅紈之盛,多於隄畔之草。豔冶極矣!

然杭人遊湖,止午、未、申三時。其實湖光染翠之工,山嵐設色之妙,皆在朝日始出,夕舂未下,始極其濃媚。月景尤不可言,花態柳情,山容水意,別是一種趣味。此樂留與山僧遊客受用,安可為俗士道哉!

 

不論各朝各代書面文字是什麼風貌,與當時口語到底能夠差異多大呢?為何一般賞析解說反倒長篇大論耶!若問用今文改寫表現一事之難易,為之者自知也。或許代代想法念頭滋養的土壤不同,起心動念的文化時尚有異,於是乎雖然都是『天地世間』的『符號文章 』,『轉譯‧再現』果然困難乎??怕是『習以為常』的思維定勢使之然的吧!!

人們習以為常的語言、文字都是符號系統。當我們講到玫瑰花,是指「可以看、聞、摸的那種植物的花」,如果缺乏感官經驗,也許根本不能知道玫瑰花是什麼?或者正因為經驗的自然平常,以致於我們忘了玫瑰花只是個符號。 有人說玫瑰花即使換個名字‥ Rose ,依然芬芳香甜,指的就是這個道理。如同月亮高掛天空,只能一指說月;天下廣大,當然也只能一指而知,所以說萬物雖然眾多,可以用像談馬一樣的東西去理 解。在歷史上大數學家邱奇的 λ 演算 ,把數學的形式系統推上了高峰,同時加深了人們對『數是什麼? 』的認識。

一朵花、一隻鳥、一座山、一片林…都是一,知道『』、又知道『加上一』,就可以知道數的無窮無盡。然而對於無窮無盡的數又該怎樣命名呢?古代中國發明了十倍為單位的記數法︰ 十十為百、十百為千、十千為萬…。初期用一、二、三、四、五、六、七、八、九、十來書寫,而後因為需要發展了大寫數字‥壹、貳、參、肆、伍、陸、柒、捌、 玖、拾。至於說為什麼用十呢?也許因為人有十個手指頭,常用來數數指物。那為什麼沒有零呢?中國古代並沒有零的符號,在概念上『九章算術』用「無入」來表 達,算盤上用「空位」去說明。現在所使用的阿拉伯記數法︰0、1、2、3、4、5、6、7、8、9,是在漫漫歷史長河中逐步變遷而來。由上述可知三百、參 佰、300 雖然說的是同一個數,它的符號卻是不同的。同樣可以知道阿拉伯記數法用位置代表數量級,所以 0 的加入是必要的。

─── 摘自《天下一指、萬物一馬︰二進制

 

那麼可否用著『此時』之符號、文字、經驗、…… 了解『彼時』之文章、人物、風情的呢??宛如置身當時,親見同受一般樣耶!!夏農之『信源編碼定理』開啟了一扇窗︰

Shannon’s source coding theorem

In information theory, Shannon’s source coding theorem (or noiseless coding theorem) establishes the limits to possible data compression, and the operational meaning of the Shannon entropy.

The source coding theorem shows that (in the limit, as the length of a stream of independent and identically-distributed random variable (i.i.d.) data tends to infinity) it is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without it being virtually certain that information will be lost. However it is possible to get the code rate arbitrarily close to the Shannon entropy, with negligible probability of loss.

The source coding theorem for symbol codes places an upper and a lower bound on the minimal possible expected length of codewords as a function of the entropy of the input word (which is viewed as a random variable) and of the size of the target alphabet.

Proof: Source coding theorem for symbol codes

For 1 ≤ in let si denote the word length of each possible xi. Define q_i = a^{-s_i}/C, where C is chosen so that q1 + … + qn = 1. Then

where the second line follows from Gibbs’ inequality and the fifth line follows from Kraft’s inequality:

C = \sum_{i=1}^n a^{-s_i} \leq 1

so log C ≤ 0.

For the second inequality we may set

s_i = \lceil - \log_a p_i \rceil

so that

 - \log_a p_i \leq s_i < -\log_a p_i + 1

and so

 a^{-s_i} \leq p_i

and

 \sum a^{-s_i} \leq \sum p_i = 1

and so by Kraft’s inequality there exists a prefix-free code having those word lengths. Thus the minimal S satisfies

───

 

依據此理論『完整資訊』不管怎麼『無漏失編碼』終究有極限也 !!??因此即使以

熵編碼法

熵編碼法是一種獨立於介質的具體特徵的進行無失真資料壓縮的方案。

一種主要類型的熵編碼建立並分配給輸入中的每個唯一的符號一個唯一的字首碼。這些編碼器然後通過用相應的可變長度字首無關(prefix-free)輸出碼字取代每個固定長度的輸入符號壓縮資料。每個碼字的長度近似與機率的負對數成比例。因此,最常見的符號使用最短的碼。

根據香農信源編碼定理,一個符號的最佳碼長是 −logbP,其中 b 是用來輸出的碼的數目,P 是輸入符號出現的機率。

霍夫曼編碼算術編碼是兩種最常見的熵編碼技術。如果預先已知資料流的近似熵特性(尤其是對於訊號壓縮),可以使用簡單的靜態碼。這些靜態碼,包括通用密碼(如Elias gamma coding或斐波那契編碼)和哥倫布編碼(比如元編碼Rice編碼)。

一般熵編碼器與其它編碼器聯合使用。比如LHA首先使用LZ編碼,然後將其結果進行熵編碼。ZipBzip的最後一級編碼也是熵編碼。

───

 

來編碼,依然『按例』可『解碼』矣??!!如是『候風地動儀』果也可重現也︰

大科學家張衡是東漢士大夫、天文學家、地理學家、數學家、發明家、文學家,南陽西鄂人,製作以水力推動的『渾天儀』,發明能探測地震方位的『候風地動儀』以及『指南車』。他發現了月蝕的真正原因,也曾繪製兩千五百顆星辰的星圖。稱名漢賦四大家之一,文學上創作了〈二京賦〉、〈歸田賦〉等等辭賦名篇。

Celestial_Globe_in_Purple_Mountain_Observatory_2011-04

220px-South-pointing_chariot_(Science_Museum_model)

EastHanSeismograph

渾天儀』就是現在的天球儀,用來演示天體運動的規律。假使其中加有『窺管』一般叫做『渾儀』,可以協助觀測天文。歷史上記載戰國時期的石申、甘德最早製作『渾象』,即是渾天儀。

傳說黃帝軒轅製造『指南車』大敗炎帝於『阪泉之戰』。指南車又叫做『司南』是一種指示方向的工具。指南車並不是『羅盤』那一類使用『磁極』的指向器物。指南車的結構使用了『差動齒輪』裝置,也叫做『差速器』。物理原理是指南車直走時,左右兩輪轉動角速度相等,差動裝置不傳動『司南者』,彎行時兩側車輪的角速度就不相等,差動機制驅動著司南者『逆其差』,因此『司南者』恆『司南』。其後不知何故,指南車的製造方法就失傳了。一九二四年英國學者穆爾 Moule 發表了研究指南車的論文並根據《宋史》文獻記載給出了具體的『復原方案』。一九三七年王振鐸發表了《指南車記里鼓車之考證及模製》一文,其中改良了穆爾的設計,並且成功的製作出指南車模型。一九七一年他根據史書記載,又成功的複製了馬鈞的『黃帝指南車』。大千世界中,忽而得之,忽而失之,得得失失,何干之於大千??

張衡的『候風地動儀』之後怎麼樣了?它又失傳了一千八百年。更神奇的是有人根據南朝劉宋范曄所著《後漢書‧卷五十九‧張衡列傳第四十九》裡的『一百九十六』個字,讓它重現天日。

陽 嘉元年,夏造候風地動儀。以精銅鑄成,員徑八尺,合蓋隆起,形似酒尊,飾以篆文山龜鳥獸之形。中有都柱,傍行八道,施關發機。外有八龍,首銜銅丸,下有蟾 蜍,張口承之。其牙機巧制,皆隱在尊中,覆蓋周密無際。如有地動,尊則振龍機發吐丸,而蟾蜍銜之。振聲激揚,伺者因此覺知。雖一龍發機,而七首不動,尋其 方面,乃知震之所在。驗之以事,合契若神。自書典所記,未之有也。嘗一龍機發而地不覺動,京師學者咸怪其無征,後數日驛至,果地震隴西,於是皆服其妙。自 此以後,乃令史官記地動所從方起。

作者想張衡集文采與科技於一身,自然非比尋常,然而候風地動儀又非簡單玩意兒,三百年後的范曄竟能只用不到一百九十六個字,就將它描述清楚,這又是何等文字功夫,再一千五百年後,還有人能解讀范曄的這一百九十六個字,使張衡的候風地動儀再現,當真是奇也妙哉!果不能知這一百九十六個字的克勞德‧艾爾伍德‧夏農 Claude Elwood Shannon 之熵 Entropy 到底是有多少『比特』bit 值!!

既然叫做『候風地動儀』,它的命名必然有些來歷。西漢末年隨著社會的衝突加劇,『讖緯之學』開始廣泛大流行。《後漢書‧光武帝紀》光武帝於中元元年宣布『圖讖』於天下,把圖讖國教化。生於之後的張衡自當深知讖緯之術。就像『物候曆』的傳統,比如說《禮記‧月令》更是其來有自。然後發展成用『占候』來『預測』人事的『吉凶禍福』。因此命名裡那個『』字應是指『徵候』,藉著此徵候來『預測』之義。而『』字當是『風角』之術的觀『八方』風的用法,藉以表達『八個方位』的意思。如此看來這個候風地動儀的名義就是『測知八方地動之器』。在此引用《逸周書‧時訓解》以及《史記‧天官書》有關『風角』的一小段,以饗有興趣的讀者。

逸周書‧時訓解

立 春之日,東風解凍,又五日,蟄蟲始振,又五日,魚上冰。風不解凍,號令不行,蟄蟲不振,陰氣奸陽,魚不上冰,甲冑私藏。驚蟄之日,獺祭魚,又五日,鴻鴈 來,又五日,草木萌動。獺不祭魚,國多盜賊,鴻鴈不來,遠人不服,草木不萌動,果疏不熟。雨水之日,桃始華,又五日,倉庚鳴,又五日,鷹化為鳩。桃不始 華,是謂陽否,倉庚不鳴,臣不從主,鷹不化鳩,寇戎數起。春分之日,玄鳥至,又五日,雷乃發聲,又五日,始電。玄鳥不至,婦人不娠,雷不發聲,諸侯失民, 不始電,君無威震。穀雨之日,桐始華,又五日,田鼠化為鴽,又五日,虹始見。桐不華,歲有大寒,田鼠不化鴽,國多貪殘,虹不見,婦人苞亂。清明之日,萍始 生,又五日,鳴鳩拂其羽,又五日,戴勝降于桑。萍不生,陰氣憤盈,鳴鳩不拂其羽,國不治兵,戴勝不降于桑,政教不中。立夏之日,螻蟈鳴,又五日,蚯蚓出, 又五日,王瓜生。螻蟈不鳴,水潦淫漫,蚯蚓不出,嬖奪后命,王瓜不生,困於百姓。小滿之日,苦菜秀,又五日,靡草死,又五日,小暑至。苦菜不秀,賢人潛 伏,靡草不死,國縱盜賊,小暑不至,是謂陰慝。芒種之日,螳螂生,又五日,鶪始鳴,又五日,反舌無聲。螳螂不生,是謂陰息,鶪不始鳴,令姦雍偪,反舌有 聲,佞人在側。夏至之日,鹿角解,又五日,蜩始鳴,又五日,半夏生。鹿角不解,兵革不息,蜩不鳴,貴臣放逸,半夏不生,民多厲疾。小暑之日,溫風至,又五 日,螅蟀居辟,又五日,鷹乃學習。溫風不生,國無寬教,螅蟀不居辟,恆急之暴,鷹不學習,不備戎盜。大暑之日,腐草為蠲,又五日,土潤溽暑,又五日,大雨 時行。腐草不為蠲,穀實鮮落,土潤不溽暑,物不應罰,大雨不時行,國無恩澤。立秋之日,涼風至,又五日,白露降,又五日,寒蟬鳴。涼風不至,國無嚴政,白 露不降,民多欬病,寒蟬不鳴,人皆力爭。處暑之日,鷹乃祭鳥,又五日,天地始肅,又五日,禾乃登。鷹不祭鳥,師旅無功,天地不肅,君臣乃□,農不登穀,暖 氣為凶。白露之日,鴻鴈來,又五日,玄鳥歸,又五日,群鳥養羞。鴻鴈不來,遠人背畔,玄鳥不歸,室家離散,群鳥不養羞,下臣驕慢。秋分之日,雷始收聲,又 五日,蟄蟲培戶,又五日,水始涸。雷不始收聲,諸侯淫汏,蟄蟲不培戶,民靡有賴,水不始涸,甲蟲為害。寒露之日,鴻鴈來賓,又五日,爵入大水為蛤,又五 日,菊有黃華。鴻鴈不來,小民不服,爵不入大水,失時之極,菊無黃華,土不稼穡。霜降之日,豺乃祭獸,又五日,草木黃落,又五日,蟄蟲咸俯。豺不祭獸,爪 牙不良,草木不黃落,是為愆陽,蟄蟲不咸俯,民多流亡。立冬之日,水始冰,又五日,地始凍,又五日,雉入大水為蜃。水不冰,是為陰負,地不始凍,咎徵之 咎,雉不入大水,國多淫婦。小雪之日,虹藏不見,又五日,天氣上騰,地氣下降,又五日,閉塞而成冬。虹不藏,婦不專一,天氣不上騰,地氣不下降,君臣相 嫉,不閉塞而成冬,母后淫佚。大雪之日,鶡旦不鳴,又五日,虎始交,又五日,荔挺生。鶡旦猶鳴,國有訛言,虎不始交,將帥不和,荔挺不生,卿士專權。冬至 之日,蚯蚓結,又五日,麋角解,又五日,水泉動。蚯蚓不結,君政不行,麋角不解,兵甲不藏,水泉不動,陰不承陽。小寒之日,鴈北向,又五日,鵲始巢,又五 日,雉始雊。鴈不北向,民不懷主,鵲不始巢,國不寧,雉不始雊,國大水。大寒之日,雞始乳,又五日,鷙鳥厲疾,又五日,水澤腹堅。雞不始乳,淫女亂男,鷙 鳥不厲,國不除姦,水澤不腹堅,言乃不從。

 

 

史記‧天官書
而漢魏鮮集臘明正月旦決八風。風從南方來,大旱;西南,小旱;西方,有兵;西北,戎菽為,小雨,趣兵;北方,為中歲;東北,為上歲;東方,大水;東南,民 有疾疫,歲惡。故八風各與其沖對,課多者為勝。多勝少,久勝亟,疾勝徐。旦至食,為麥;食至日昳,為稷;昳至餔,為黍;餔至下餔,為菽;下餔至日入,為 麻。欲終日(有雨)有雲,有風,有日。日當其時者,深而多實;無雲有風日,當其時,淺而多實;有雲風,無日,當其時,深而少實;有日,無雲,不風,當其時 者稼有敗。如食頃,小敗;熟五斗米頃,大敗。則風復起,有雲,其稼復起。各以其時用雲色占種(其)所宜。其雨雪若寒,歲惡。

易經《說卦傳》上《第十一章》講『震為雷為龍』,『』又是『』意思,所謂『帝出乎震』是說『春雷』振奮大地,也許正是張衡在候風地動儀上用『龍吐丸』的原因。據聞《南瞻部洲志》載

日 有踆烏,月有蟾蜍。羿請不死之藥於西王母,嫦娥竊之以奔月。蟾蜍本乃嫦娥之茶寵,食余茶而化為仙獸,亦得仙奔月,是為月魄。初,月魄为三足,然其日食靈 芝,夜食月桂,歷三千年而修成四足。后有吳剛者,為帝懲治而伐桂,斧起,樹創而瞬時癒之,歷八十一天始落一枝,月魄不勝其煩,遂銜桂枝而下界,有緣者可得 其侍奉,謂之折桂也。

不知是否是因為『震位東方』而『兌在西方』為『』,所以才用『月魄蟾蜍』銜之。加之以篆文與鳥獸之形,儼然是個『神器』。

都

機

關

俗話說︰時過境遷。這個『時境原則』在閱讀『歷史』的文獻時非常重要。人們很容易不自覺的把『字詞』的『此時』意義強加於『彼時』,以至於發生了『誤讀』現象。

范曄文中之『中有柱,行八道,施』是講候風地動儀的內部構造,故為『要點』。『』字的造字本義是『有關卡城門把守的大城市』。『』字是指『將門閂插進左右兩栓孔,緊閉大門』。『』字是『事物發生的樞紐』。

在此我們將范曄之文,與『測知地震』有關的整理如下︰

一、『精銅鑄成,員徑八尺』,漢代一尺約現今 23.09 公分,可知直徑八尺的『酒尊』,真是『又大又重』。

二、『外有八龍,首銜銅丸,下有蟾蜍,張口承之』,平日沒地震時『銅丸』銜於『龍首』。

三、『其牙機巧制,皆隱在尊中,覆蓋周密無際。』,『密封的』很好,『牙巧』之『機制』隱於其中。

四、『如有地動,尊則振龍機發吐丸,而蟾蜍銜之。』,遇到地震時,『尊則振』同時『龍機發吐丸』,而且『蟾蜍銜之』。也就是說,地震會讓『』振動,且觸發了『牙機巧制』。

── 摘自《【Sonic π】聲波之傳播原理︰拾遺篇《一》候風地動儀 ‧上

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】七

『統計力學』與『資訊理論』終究不得不相會於『隨機性』以及『不確定性』之『機率』概念上!『最大熵』原理直衝牛斗︰

Principle of maximum entropy

The principle of maximum entropy states that, subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy.

Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data. Of those, the one with maximal information entropy is the proper distribution, according to this principle.

History

The principle was first expounded by E. T. Jaynes in two papers in 1957[1][2] where he emphasized a natural correspondence between statistical mechanics and information theory. In particular, Jaynes offered a new and very general rationale why the Gibbsian method of statistical mechanics works. He argued that the entropy of statistical mechanics and the information entropy of information theory are principally the same thing. Consequently, statistical mechanics should be seen just as a particular application of a general tool of logical inference and information theory.

……

The Wallis derivation

The following argument is the result of a suggestion made by Graham Wallis to E. T. Jaynes in 1962.[8] It is essentially the same mathematical argument used for the Maxwell–Boltzmann statistics in statistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of ‘uncertainty’, ‘uninformativeness’, or any other imprecisely defined concept. The information entropy function is not assumed a priori, but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.

Suppose an individual wishes to make a probability assignment among m mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute N quanta of probability (each worth 1/N) at random among the m possibilities. (One might imagine that she will throw N balls into m buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. (For this step to be successful, the information must be a constraint given by an open set in the space of probability measures). If it is inconsistent, she will reject it and try again. If it is consistent, her assessment will be

p_i = \frac{n_i}{N}

where pi is the probability of the ith proposition, while ni is the number of quanta that were assigned to the ith proposition (i.e. the number of balls that ended up in bucket i).

Now, in order to reduce the ‘graininess’ of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,

Pr(\mathbf{p}) = W \cdot m^{-N}

where

W = \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!}

is sometimes known as the multiplicity of the outcome.

The most probable result is the one which maximizes the multiplicity W. Rather than maximizing W directly, the protagonist could equivalently maximize any monotonic increasing function of W. She decides to maximize

\begin{array}{rcl} \frac{1}{N}\log W &=& \frac{1}{N}\log \frac{N!}{n_1! \, n_2! \, \dotsb \, n_m!} \\ \\ &=& \frac{1}{N}\log \frac{N!}{(Np_1)! \, (Np_2)! \, \dotsb \, (Np_m)!} \\ \\ &=& \frac{1}{N}\left( \log N! - \sum_{i=1}^m \log ((Np_i)!) \right). \end{array}

At this point, in order to simplify the expression, the protagonist takes the limit as N\to\infty, i.e. as the probability levels go from grainy discrete values to smooth continuous values. Using Stirling’s approximation, she finds

\begin{array}{rcl} \lim_{N \to \infty}\left(\frac{1}{N}\log W\right) &=& \frac{1}{N}\left( N\log N - \sum_{i=1}^m Np_i\log (Np_i) \right) \\ \\ &=& \log N - \sum_{i=1}^m p_i\log (Np_i) \\ \\ &=& \log N - \log N \sum_{i=1}^m p_i - \sum_{i=1}^m p_i\log p_i \\ \\ &=& \left(1 - \sum_{i=1}^m p_i \right)\log N - \sum_{i=1}^m p_i\log p_i \\ \\ &=& - \sum_{i=1}^m p_i\log p_i \\ \\ &=& H(\mathbf{p}). \end{array}

All that remains for the protagonist to do is to maximize entropy under the constraints of her testable information. She has found that the maximum entropy distribution is the most probable of all “fair” random distributions, in the limit as the probability levels go from discrete to continuous.

───

 

這位傑尼斯

Edwin Thompson Jaynes

Edwin Thompson Jaynes (July 5, 1922 – April 30,[1] 1998) was the Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis. He wrote extensively on statistical mechanics and on foundations of probability and statistical inference, initiating in 1957 the MaxEnt interpretation of thermodynamics,[2][3] as being a particular application of more general Bayesian/information theory techniques (although he argued this was already implicit in the works of Gibbs). Jaynes strongly promoted the interpretation of probability theory as an extension of logic.

───

 

先生將波利亞之『似合理的』Plausible 『推理』系統化,

George_Pólya_ca_1973

喬治‧波利亞
George Pólya

How to Solve It

suggests the following steps when solving a mathematical problem:

1. First, you have to understand the problem.
2. After understanding, then make a plan.
3. Carry out the plan.
4. Look back on your work. How could it be better?

If this technique fails, Pólya advises: “If you can’t solve a problem, then there is an easier problem you can solve: find it.” Or: “If you cannot solve the proposed problem, try to solve first some related problem. Could you imagine a more accessible related problem?”

喬治‧波利亞長期從事數學教學,對數學思維的一般規律有深入的研究,一生推動數學教育。一九五四年,波利亞寫了兩卷不同於一般的數學書《Induction And_Analogy In Mathematics》與《Patterns Of Plausible Inference》探討『啟發式』之『思維樣態』,這常常是一種《數學發現》之切入點,也是探尋『常識徵候』中的『合理性』根源。舉個例子來說,典型的亞里斯多德式的『三段論』 syllogisms ︰

P \Longrightarrow Q
P 真, \therefore Q 真。

如果對比著『似合理的』Plausible 『推理』︰

P \Longrightarrow Q
Q 真, P 更可能是真。

這種『推理』一般稱之為『肯定後件Q 的『邏輯誤謬』。因為在『邏輯』上,這種『形式』的推導,並不『必然的』保障『歸結』一定是『』的。然而這種『推理形式』是完全沒有『道理』的嗎?如果從『三段論』之『邏輯』上來講,要是 Q 為『』,P 也就『必然的』為『』。所以假使 P 為『』之『必要條件Q 為『』,那麼 P 不該是『更可能』是『』的嗎??

─── 摘自《物理哲學·下中……

 

把『機率論』帶入邏輯殿堂,

Probability Theory: The Logic Of Science

he material available from this page is a pdf version of E.T. Jaynes’s book.

Introduction

Please note that the contents of the file from the link below is slightly of out sync with the actual contents of the book. The listing on this page correspond to the existing chapter order and names.

……

PT-1

 

PT-2

───

 

誠非偶然的耶!!??若是人人都能如是『運作理則』,庶幾可免『賭徒謬誤』矣??!!

Gambler’s fallacy

The gambler’s fallacy, also known as the Monte Carlo fallacy or the fallacy of the maturity of chances, is the mistaken belief that, if something happens more frequently than normal during some period, it will happen less frequently in the future, or that, if something happens less frequently than normal during some period, it will happen more frequently in the future (presumably as a means of balancing nature). In situations where what is being observed is truly random (i.e., independent trials of a random process), this belief, though appealing to the human mind, is false. This fallacy can arise in many practical situations although it is most strongly associated with gambling where such mistakes are common among players.

The use of the term Monte Carlo fallacy originates from the most famous example of this phenomenon, which occurred in a Monte Carlo Casino in 1913.[1][better source needed][2]

……

Examples

Coin toss

The gambler’s fallacy can be illustrated by considering the repeated toss of a fair coin. With a fair coin, the outcomes in different tosses are statistically independent and the probability of getting heads on a single toss is exactly 1/2 (one in two). It follows that the probability of getting two heads in two tosses is 1/4 (one in four) and the probability of getting three heads in three tosses is 1/8 (one in eight). In general, if we let Ai be the event that toss i of a fair coin comes up heads, then we have,

\Pr\left(\bigcap_{i=1}^n A_i\right)=\prod_{i=1}^n \Pr(A_i)={1\over2^n}.

Now suppose that we have just tossed four heads in a row, so that if the next coin toss were also to come up heads, it would complete a run of five successive heads. Since the probability of a run of five successive heads is only 1/32 (one in thirty-two), a person subject to the gambler’s fallacy might believe that this next flip was less likely to be heads than to be tails. However, this is not correct, and is a manifestation of the gambler’s fallacy; the event of 5 heads in a row and the event of “first 4 heads, then a tails” are equally likely, each having probability 1/32. Given that the first four tosses turn up heads, the probability that the next toss is a head is in fact,

\Pr\left(A_5|A_1 \cap A_2 \cap A_3 \cap A_4 \right)=\Pr\left(A_5\right)=\frac{1}{2}.

While a run of five heads is only 1/32 = 0.03125, it is only that before the coin is first tossed. After the first four tosses the results are no longer unknown, so their probabilities are 1. Reasoning that it is more likely that the next toss will be a tail than a head due to the past tosses, that a run of luck in the past somehow influences the odds in the future, is the fallacy.

 

Lawoflargenumbersanimation2

Simulation of coin tosses: Each frame, a coin is flipped which is red on one side and blue on the other. The result of each flip is added as a colored dot in the corresponding column. As the pie chart shows, the proportion of red versus blue approaches 50-50 (the law of large numbers). But the difference between red and blue does not systematically decrease to zero.

───