W!o+ 的《小伶鼬工坊演義》︰神經網絡【學而堯曰】十

派生碼訊

子 鼠

最高樓‧杜甫

城尖徑厭施篩愁,獨立縹緲之飛樓。
峽坼雲霾龍虎臥,江清日抱黿鼉游。
扶桑西枝對斷石,弱水東影隨長流。
杖藜嘆世者誰子?泣血迸空回白頭。

︰老杜倡議拗體詩,誰知拗得拗不得?

今天小企鵝學堂上,吵吵鬧鬧,此時那些小企鵝們正憤愾的議論著禮拜天還『補課』之事的呢 !!

有的主張︰既然是天放假,幹嘛要補課。

有的辯證︰就算今天不補,改天補還不是一樣?

有的議論︰早補早了,有什麼好議論的!!

有的搞笑︰『始』安息日,『終』禮拜天!不就一天嗎??

那你可有說法?

哦!我唱ㄏㄡˇ你ㄗㄞ︰今天不只要補課,……

而且ㄍㄡˋ要實習,但是用的是『咸澤碼訊』

‧高級的

‧沒學過的

‧ㄍㄚˋ伊講不通的

唔是□派○生的呦!!

啞是甲意喝一杯咖啡??

BabyTux

graphics-tux-638974

sonictux

starbucks_tux_linux_art-555px

─── 摘自《M♪o 之學習筆記本《子》開關︰【䷝】狀態編碼

 

補課不易加課難,惟因學堂不好玩。

授業解惑苦失傳,道鎖高樓難復還。

讀書原是本家事,學問方田自耕耘。為探『交叉熵』從何起???穿梭尋索字海中,終是抵

Using the cross-entropy to classify MNIST digits

The cross-entropy is easy to implement as part of a program which learns using gradient descent and backpropagation. We’ll do that later in the chapter, developing an improved version of our earlier program for classifying the MNIST handwritten digits, network.py. The new program is called network2.py, and incorporates not just the cross-entropy, but also several other techniques developed in this chapter* *The code is available on GitHub.. For now, let’s look at how well our new program classifies MNIST digits. As was the case in Chapter 1, we’ll use a network with 30 hidden neurons, and we’ll use a mini-batch size of 10. We set the learning rate to \eta = 0.5*

*In Chapter 1 we used the quadratic cost and a learning rate of \eta = 3.0. As discussed above, it’s not possible to say precisely what it means to use the “same” learning rate when the cost function is changed. For both cost functions I experimented to find a learning rate that provides near-optimal performance, given the other hyper-parameter choices.

There is, incidentally, a very rough general heuristic for relating the learning rate for the cross-entropy and the quadratic cost. As we saw earlier, the gradient terms for the quadratic cost have an extra \sigma' = \sigma(1-\sigma) term in them. Suppose we average this over values for \sigma, \int_0^1 d\sigma \sigma(1-\sigma) = 1/6. We see that (very roughly) the quadratic cost learns an average of 6 times slower, for the same learning rate. This suggests that a reasonable starting point is to divide the learning rate for the quadratic cost by 6. Of course, this argument is far from rigorous, and shouldn’t be taken too seriously. Still, it can sometimes be a useful starting point.

and we train for 30 epochs. The interface to network2.py is slightly different than network.py, but it should still be clear what is going on. You can, by the way, get documentation about network2.py‘s interface by using commands such as help(network2.Network.SGD) in a Python shell.

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,
... monitor_evaluation_accuracy=True)

Note, by the way, that the net.large_weight_initializer() command is used to initialize the weights and biases in the same way as described in Chapter 1. We need to run this command because later in this chapter we’ll change the default weight initialization in our networks. The result from running the above sequence of commands is a network with 95.49 percent accuracy. This is pretty close to the result we obtained in Chapter 1, 95.42 percent, using the quadratic cost.

……

What does the cross-entropy mean? Where does it come from?

Our discussion of the cross-entropy has focused on algebraic analysis and practical implementation. That’s useful, but it leaves unanswered broader conceptual questions, like: what does the cross-entropy mean? Is there some intuitive way of thinking about the cross-entropy? And how could we have dreamed up the cross-entropy in the first place?

Let’s begin with the last of these questions: what could have motivated us to think up the cross-entropy in the first place? Suppose we’d discovered the learning slowdown described earlier, and understood that the origin was the \sigma'(z) terms in Equations (55) and (56). After staring at those equations for a bit, we might wonder if it’s possible to choose a cost function so that the \sigma'(z) term disappeared. In that case, the cost C = C_x for a single training example x would satisfy

\frac{\partial C}{\partial w_j} & = & x_j(a-y) \ \ \ \ (71)
\frac{\partial C}{\partial b } & = & (a-y). \ \ \ \ (72)

If we could choose the cost function to make these equations true, then they would capture in a simple way the intuition that the greater the initial error, the faster the neuron learns. They’d also eliminate the problem of a learning slowdown. In fact, starting from these equations we’ll now show that it’s possible to derive the form of the cross-entropy, simply by following our mathematical noses. To see this, note that from the chain rule we have

\frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} \sigma'(z). \ \ \ \ (73)

Using \sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a) the last equation becomes

\frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} a(1-a). \ \ \ \ (74)

Comparing to Equation (72) we obtain

\frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}. \ \ \ \ (75)

Integrating this expression with respect to a gives

C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant}, \ \ \ \ (76)

for some constant of integration. This is the contribution to the cost from a single training example, x. To get the full cost function we must average over training examples, obtaining

C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant}, \ \ \ \ (77)

where the constant here is the average of the individual constants for each training example. And so we see that Equations (71) and (72) uniquely determine the form of the cross-entropy, up to an overall constant term. The cross-entropy isn’t something that was miraculously pulled out of thin air. Rather, it’s something that we could have discovered in a simple and natural way.

What about the intuitive meaning of the cross-entropy? How should we think about it? Explaining this in depth would take us further afield than I want to go. However, it is worth mentioning that there is a standard way of interpreting the cross-entropy that comes from the field of information theory. Roughly speaking, the idea is that the cross-entropy is a measure of surprise. In particular, our neuron is trying to compute the function x \rightarrow y = y(x). But instead it computes the function x \rightarrow a = a(x). Suppose we think of a as our neuron’s estimated probability that y is 1, and 1-a is the estimated probability that the right value for y is 0. Then the cross-entropy measures how “surprised” we are, on average, when we learn the true value for y. We get low surprise if the output is what we expect, and high surprise if the output is unexpected. Of course, I haven’t said exactly what “surprise” means, and so this perhaps seems like empty verbiage. But in fact there is a precise information-theoretic way of saying what is meant by surprise. Unfortunately, I don’t know of a good, short, self-contained discussion of this subject that’s available online. But if you want to dig deeper, then Wikipedia contains a brief summary that will get you started down the right track. And the details can be filled in by working through the materials about the Kraft inequality in chapter 5 of the book about information theory by Cover and Thomas.

──

 

欲解玄秘,尚須了『相對熵』之精義也!!!

Kullback–Leibler divergence

Definition

For discrete probability distributions P and Q, the Kullback–Leibler divergence of Q from P is defined[5] to be

D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \log\frac{P(i)}{Q(i)}.

In words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P. The Kullback–Leibler divergence is defined only if Q(i)=0 implies P(i)=0, for all i (absolute continuity). Whenever P(i) is zero the contribution of the i-th term is interpreted as zero because \lim_{x \to 0} x \log(x) = 0.

For distributions P and Q of a continuous random variable, the Kullback–Leibler divergence is defined to be the integral:[6]

D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^\infty p(x) \, \log\frac{p(x)}{q(x)} \, {\rm d}x, \!

where p and q denote the densities of P and Q.

More generally, if P and Q are probability measures over a set X, and P is absolutely continuous with respect to Q, then the Kullback–Leibler divergence from P to Q is defined as

 D_{\mathrm{KL}}(P\|Q) = \int_X \log\frac{{\rm d}P}{{\rm d}Q} \, {\rm d}P, \!

where \frac{{\rm d}P}{{\rm d}Q} is the Radon–Nikodym derivative of P with respect to Q, and provided the expression on the right-hand side exists. Equivalently, this can be written as

 D_{\mathrm{KL}}(P\|Q) = \int_X \log\!\left(\frac{{\rm d}P}{{\rm d}Q}\right) \frac{{\rm d}P}{{\rm d}Q} \, {\rm d}Q,

which we recognize as the entropy of P relative to Q. Continuing in this case, if \mu is any measure on X for which p = \frac{{\rm d}P}{{\rm d}\mu} and q = \frac{{\rm d}Q}{{\rm d}\mu} exist (meaning that p and q are absolutely continuous with respect to \mu), then the Kullback–Leibler divergence from P to Q is given as

 D_{\mathrm{KL}}(P\|Q) = \int_X p \, \log \frac{p}{q} \, {\rm d}\mu. \!

The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base e if information is measured in nats. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm.

Various conventions exist for referring to DKL(PQ) in words. Often it is referred to as the divergence between P and Q; however this fails to convey the fundamental asymmetry in the relation. Sometimes it may be found described as the divergence of P from, or with respect to Q (often in the context of relative entropy, or information gain). However, in the present article the divergence of Q from P will be the language used, as this best relates to the idea that it is P that is considered the underlying “true” or “best guess” distribution, that expectations will be calculated with reference to, while Q is some divergent, less good, approximate distribution.

……

Motivation

In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value xi out of a set of possibilities X can be seen as representing an implicit probability distribution q(xi)=2li over X, where li is the length of the code for xi in bits. Therefore, the Kullback–Leibler divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P.

 \begin{matrix} D_{\mathrm{KL}}(P\|Q) & = &-\sum_x p(x) \log q(x)& + & \sum_x p(x) \log p(x) \\[0.5em] & = & H(P,Q) & - & H(P)\, \! \end{matrix}

where H(P,Q) is the cross entropy of P and Q, and H(P) is the entropy of P.

Note also that there is a relation between the Kullback–Leibler divergence and the “rate function” in the theory of large deviations.[9][10]

KL-Gauss-Example

Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian distributions. Note the typical asymmetry for the Kullback–Leibler divergence is clearly visible.

……

Discrimination information

The Kullback–Leibler divergence DKL( p(x|H1) ‖ p(x|H0) ) can also be interpreted as the expected discrimination information for H1 over H0: the mean information per sample for discriminating in favor of a hypothesis H1 against a hypothesis H0, when hypothesis H1 is true.[16] Another name for this quantity, given to it by I.J. Good, is the expected weight of evidence for H1 over H0 to be expected from each sample.

The expected weight of evidence for H1 over H0 is not the same as the information gain expected per sample about the probability distribution p(H) of the hypotheses,

D_\mathrm{KL}( p(x|H_1) \| p(x|H_0) ) \neq IG = D_\mathrm{KL}( p(H|x) \| p(H|I) ).

Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination information

The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution f should be chosen which is as hard to discriminate from the original distribution f0 as possible; so that the new data produces as small an information gain DKL( ff0 ) as possible.

For example, if one had a prior distribution p(x,a) over x and a, and subsequently learnt the true distribution of a was u(a), the Kullback–Leibler divergence between the new joint distribution for x and a, q(x|a) u(a), and the earlier prior distribution would be:

D_\mathrm{KL}(q(x|a)u(a)\|p(x,a)) = \operatorname{E}_{u(a)}\{D_\mathrm{KL}(q(x|a)\|p(x|a))\} + D_\mathrm{KL}(u(a)\|p(a)),

i.e. the sum of the Kullback–Leibler divergence of p(a) the prior distribution for a from the updated distribution u(a), plus the expected value (using the probability distribution u(a)) of the Kullback–Leibler divergence of the prior conditional distribution p(x|a) from the new conditional distribution q(x|a). (Note that often the later expected value is called the conditional Kullback–Leibler divergence (or conditional relative entropy) and denoted by DKL(q(x|a)‖p(x|a))[17]) This is minimized if q(x|a) = p(x|a) over the whole support of u(a); and we note that this result incorporates Bayes’ theorem, if the new distribution u(a) is in fact a δ function representing certainty that a has one particular value.

MDI can be seen as an extension of Laplace‘s Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the Kullback–Leibler divergence continues to be just as relevant.

In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising the Kullback–Leibler divergence of m from p with respect to m is equivalent to minimizing the cross-entropy of p and m, since

H(p,m) = H(p) + D_{\mathrm{KL}}(p\|m),

which is appropriate if one is trying to choose an adequate approximation to p. However, this is just as often not the task one is trying to achieve. Instead, just as often it is m that is some fixed prior reference measure, and p that one is attempting to optimise by minimising DKL(pm) subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be DKL(pm), rather than H(p,m).

……

Relationship to available work

Surprisals[18] add where probabilities multiply. The surprisal for an event of probability p is defined as s=k \ln(1 / p). If k is \{ 1, 1/\ln 2, 1.38\times 10^{-23}\} then surprisal is in \{nats, bits, or J/K\} so that, for instance, there are N bits of surprisal for landing all “heads” on a toss of N coins.

Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal S (entropy) for a given set of control parameters (like pressure P or volume V). This constrained entropy maximization, both classically[19] and quantum mechanically,[20] minimizes Gibbs availability in entropy units[21] A\equiv -k \ln Z where Z is a constrained multiplicity or partition function.

When temperature T is fixed, free energy (T \times A) is also minimized. Thus if T, V and number of molecules N are constant, the Helmholtz free energy F\equiv U-TS (where U is energy) is minimized as a system “equilibrates.” If T and P are held constant (say during processes in your body), the Gibbs free energy G=U+PV-TS is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature T_o and pressure P_o is W = \Delta G =NkT_o \Theta(V/V_o) where V_o = NkT_o/P_o and \Theta(x)=x-1-\ln x\ge 0 (see also Gibbs inequality).

More generally[22] the work available relative to some ambient is obtained by multiplying ambient temperature T_o by Kullback–Leibler divergence or net surprisal \Delta I\ge 0, defined as the average value of k\ln(p/p_o) where p_o is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of V_o and T_o is thus W=T_o \Delta I, where Kullback–Leibler divergence \Delta I = Nk[\Theta(V/V_o)+\frac{3}{2}\Theta(T/T_o)]. The resulting contours of constant Kullback–Leibler divergence, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.[23] Thus Kullback–Leibler divergence measures thermodynamic availability in bits.

ArgonKLdivergence

Pressure versus volume plot of available work from a mole of Argon gas relative to ambient, calculated as T_o times the Kullback–Leibler divergence.

───