W!o+ 的《小伶鼬工坊演義》︰神經網絡【隨機變數】一

國語‧《鄭語

桓公為司徒,甚得周眾與東土之人,問于史伯曰:「王室多故,余懼及焉,其何所可以逃死?」史伯對曰:「王室將卑,戎狄必昌,不可偪也。當成周者,有荊、蠻、申、呂、應、鄧、陳、蔡、隨 、唐;有衛、燕、狄、鮮虞、潞、洛、泉、徐、蒲;西有虞、虢 、晉、隗、霍、楊、魏、芮;有齊、魯、曹、宋、滕、薛、鄒、莒;是非王之支子母弟甥舅也,則皆蠻、荊、戎、狄之人也。非親則頑,不可入也。其濟、洛、河、潁之間乎!是其子男之國,虢、鄶、為大,虢叔恃勢,鄶仲恃險,是皆有驕侈怠慢之心,而加之以貪冒。君若以周難之故,寄孥與賄焉,不敢不許。周亂而弊,是驕而貪,必將背君,君若以成周之眾,奉辭伐罪,無不克矣。若克二邑 ,鄔、弊、補、舟、衣、柔、歷、華,君之土也。若前華後河,右洛左濟,主芣、騩而食溱、洧,修典刑以守之,是可以少固。」

公曰:「南方不可乎?」對曰:「夫荊子熊嚴生子四人:伯霜、仲雪、叔熊、季紃。叔熊逃難于濮而蠻,季紃是立,薳氏將起之,禍又不克。是天啟之心也。又甚聰明和協,蓋其先王。臣聞之,天之所啟,十世不替。夫其子孫必光啟土,不可偪也。且重、黎之後也 ,夫黎為高辛氏火正,以淳耀敦大,天明地德,光照四海,故命之曰『祝融」,其功大矣。

 

讀書遇到雖然字字認識,卻是各各所指不明之時,將如之何哉?? Michael Nielsen 先生大概假設讀者知道『高斯』是誰!『高斯函數 』何指!了解什麼是『常態分佈』矣!!所以行文才如此若無其事舉重若輕的耶??!!

Weight initialization

When we create our neural networks, we have to make choices for the initial weights and biases. Up to now, we’ve been choosing them according to a prescription which I discussed only briefly back in Chapter 1. Just to remind you, that prescription was to choose both the weights and biases using independent Gaussian random variables, normalized to have mean 0 and standard deviation 1. While this approach has worked well, it was quite ad hoc, and it’s worth revisiting to see if we can find a better way of setting our initial weights and biases, and perhaps help our neural networks learn faster.

It turns out that we can do quite a bit better than initializing with normalized Gaussians. To see why, suppose we’re working with a network with a large number – say 1,000 – of input neurons. And let’s suppose we’ve used normalized Gaussians to initialize the weights connecting to the first hidden layer. For now I’m going to concentrate specifically on the weights connecting the input neurons to the first neuron in the hidden layer, and ignore the rest of the network:

We’ll suppose for simplicity that we’re trying to train using a training input x in which half the input neurons are on, i.e., set to 1, and half the input neurons are off, i.e., set to 0. The argument which follows applies more generally, but you’ll get the gist from this special case. Let’s consider the weighted sum z = \sum_j w_j x_j+b of inputs to our hidden neuron. 500 terms in this sum vanish, because the corresponding input x_j is zero. And so z is a sum over a total of 501 normalized Gaussian random variables, accounting for the 500 weight terms and the 1 extra bias term. Thus z is itself distributed as a Gaussian with mean zero and standard deviation \sqrt{501} \approx 22.4. That is, z has a very broad Gaussian distribution, not sharply peaked at all:

Gaussian-1

In particular, we can see from this graph that it’s quite likely that |z| will be pretty large, i.e., either z \gg 1 or z \ll -1. If that’s the case then the output \sigma(z) from the hidden neuron will be very close to either 1 or 0. That means our hidden neuron will have saturated. And when that happens, as we know, making small changes in the weights will make only absolutely miniscule changes in the activation of our hidden neuron. That miniscule change in the activation of the hidden neuron will, in turn, barely affect the rest of the neurons in the network at all, and we’ll see a correspondingly miniscule change in the cost function. As a result, those weights will only learn very slowly when we use the gradient descent algorithm*

*We discussed this in more detail in Chapter 2, where we used the equations of backpropagation to show that weights input to saturated neurons learned slowly.

. It’s similar to the problem we discussed earlier in this chapter, in which output neurons which saturated on the wrong value caused learning to slow down. We addressed that earlier problem with a clever choice of cost function. Unfortunately, while that helped with saturated output neurons, it does nothing at all for the problem with saturated hidden neurons.

I’ve been talking about the weights input to the first hidden layer. Of course, similar arguments apply also to later hidden layers: if the weights in later hidden layers are initialized using normalized Gaussians, then activations will often be very close to 0 or 1, and learning will proceed very slowly.

Is there some way we can choose better initializations for the weights and biases, so that we don’t get this kind of saturation, and so avoid a learning slowdown? Suppose we have a neuron with n_{\rm in} input weights. Then we shall initialize those weights as Gaussian random variables with mean 0 and standard deviation 1/\sqrt{n_{\rm in}}. That is, we’ll squash the Gaussians down, making it less likely that our neuron will saturate. We’ll continue to choose the bias as a Gaussian with mean 0 and standard deviation 1, for reasons I’ll return to in a moment. With these choices, the weighted sum z = \sum_j w_j x_j + b will again be a Gaussian random variable with mean 0, but it’ll be much more sharply peaked than it was before. Suppose, as we did earlier, that 500 of the inputs are zero and 500 are 1. Then it’s easy to show (see the exercise below) that z has a Gaussian distribution with mean 0 and standard deviation \sqrt{3/2} = 1.22\ldots. This is much more sharply peaked than before, so much so that even the graph below understates the situation, since I’ve had to rescale the vertical axis, when compared to the earlier graph:

Gaussian-2

Such a neuron is much less likely to saturate, and correspondingly much less likely to have problems with a learning slowdown.

Exercise

  • Verify that the standard deviation of z = \sum_j w_j x_j + b in the paragraph above is \sqrt{3/2}. It may help to know that: (a) the variance of a sum of independent random variables is the sum of the variances of the individual random variables; and (b) the variance is the square of the standard deviation.

I stated above that we’ll continue to initialize the biases as before, as Gaussian random variables with a mean of 0 and a standard deviation of 1. This is okay, because it doesn’t make it too much more likely that our neurons will saturate. In fact, it doesn’t much matter how we initialize the biases, provided we avoid the problem with saturation. Some people go so far as to initialize all the biases to 0, and rely on gradient descent to learn appropriate biases. But since it’s unlikely to make much difference, we’ll continue with the same initialization procedure as before.

───

 

就算能夠依題解題

\sigma^2 (z) = \sum_j \sigma^2 (w_j) x_j + \sigma^2 (b) = \frac{500}{1000} + 1 = \frac{3}{2}

怕是亦不知所云也!!??

此處 \sigma^2 是『隨機變數』之『變異數』 variance ,它決定了機率分布的幅度。至於『高斯分佈』也叫『常態分佈』,維基百科詞條這麼說︰

Normal distribution

In probability theory, the normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.[1][2]

The normal distribution is useful because of the central limit theorem. In its most general form, under some conditions (which include finite variance), it states that averages of random variables independently drawn from independent distributions converge in distribution to the normal, that is, become normally distributed when the number of random variables is sufficiently large. Physical quantities that are expected to be the sum of many independent processes (such as measurement errors) often have distributions that are nearly normal.[3] Moreover, many results and methods (such as propagation of uncertainty and least squares parameter fitting) can be derived analytically in explicit form when the relevant variables are normally distributed.

The normal distribution is sometimes informally called the bell curve. However, many other distributions are bell-shaped (such as the Cauchy, Student’s t, and logistic distributions). The terms Gaussian function and Gaussian bell curve are also ambiguous because they sometimes refer to multiples of the normal distribution that cannot be directly interpreted in terms of probabilities.

The probability density of the normal distribution is:

Where:

A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

Normal_Distribution_PDF.svg

The red curve is the standard normal distribution

───

 

對於『獨立』高斯『隨機變數』而言,會

Sum of normally distributed random variables

In probability theory, calculation of the sum of normally distributed random variables is an instance of the arithmetic of random variables, which can be quite complex based on the probability distributions of the random variables involved and their relationships.

Independent random variables

If X and Y are independent random variables that are normally distributed (and therefore also jointly so), then their sum is also normally distributed. i.e., if

X ∼ N ( μ X , σ X 2 )
Y ∼ N ( μ Y , σ Y 2 )
Z = X + Y ,

then

Z ∼ N ( μ X + μ Y , σ X 2 + σ Y 2 ) .

This means that the sum of two independent normally distributed random variables is normal, with its mean being the sum of the two means, and its variance being the sum of the two variances (i.e., the square of the standard deviation is the sum of the squares of the standard deviations).

Note that the result that the sum is normally distributed requires the assumption of independence, not just uncorrelatedness; two separately (not jointly) normally distributed random variables can be uncorrelated without being independent, in which case their sum can be non-normally distributed (see Normally distributed and uncorrelated does not imply independent#A symmetric example). The result about the mean holds in all cases, while the result for the variance requires uncorrelatedness, but not independence.

Proofs

Proof using characteristic functions

[citation needed]

The characteristic function

φ X + Y ( t ) = E ⁡ ( e i t ( X + Y ) )

of the sum of two independent random variables X and Y is just the product of the two separate characteristic functions:

φ X ( t ) = E ⁡ ( e i t X ) , φ Y ( t ) = E ⁡ ( e i t Y )

of X and Y.

The characteristic function of the normal distribution with expected value μ and variance σ2 is

φ ( t ) = exp ⁡ ( i t μ − σ 2 t 2 2 ) .

So

φ X + Y ( t ) = φ X ( t ) φ Y ( t ) = exp ⁡ ( i t μ X − σ X 2 t 2 2 ) exp ⁡ ( i t μ Y − σ Y 2 t 2 2 ) = exp ⁡ ( i t ( μ X + μ Y ) − ( σ X 2 + σ Y 2 ) t 2 2 ) .

This is the characteristic function of the normal distribution with expected value μ X + μ Y and variance σ X 2 + σ Y 2

Finally, recall that no two distinct distributions can both have the same characteristic function, so the distribution of X+Y must be just this normal distribution.

───

 

縱使言之確確,不過是之乎者也罷!!!為學之道,豈可囫圇吞棗耶???