W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】二

為什麼 Michael Nielsen 要講他用的『記號法』初看有點『怪異』的呢??又為什麼看久之後反倒覺得『自然』的哩!!

The following diagram shows examples of these notations in use:

With these notations, the activation a^l_j of the j^{\rm th} neuron in the l^{\rm th} layer is related to the activations in the (l-1)^{\rm th} layer by the equation (compare Equation (4) and surrounding discussion in the last chapter)

a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \ \ \ \ (23)

where the sum is over all neurons k in the (l-1)^{\rm th} layer. To rewrite this expression in a matrix form we define a weight matrix w^l for each layer, l. The entries of the weight matrix w^l are just the weights connecting to the l^{\rm th} layer of neurons, that is, the entry in the j^{\rm th} row and k^{\rm th} column is w^l_{jk}. Similarly, for each layer l we define a bias vector, b^l. You can probably guess how this works – the components of the bias vector are just the values b^l_j, one component for each neuron in the l^{\rm th} layer. And finally, we define an activation vector a^l whose components are the activations a^l_j.

 

The last ingredient we need to rewrite (23) in a matrix form is the idea of vectorizing a function such as \sigma. We met vectorization briefly in the last chapter, but to recap, the idea is that we want to apply a function such as \sigma to every element in a vector v. We use the obvious notation \sigma (v) to denote this kind of elementwise application of a function. That is, the components of \sigma (v) are just {\sigma (v)}_j = \sigma (v_j). As an example, if we have the function f(x) = x^2 then the vectorized form of f has the effect

f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right) = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right] = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right], \ \ \ \ (24)

that is, the vectorized f just squares every element of the vector.

With these notations in mind, Equation (23) can be rewritten in the beautiful and compact vectorized form

a^{l} = \sigma(w^l a^{l-1}+b^l). \ \ \ \ (25)

This expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the \sigma function*

*By the way, it’s this expression that motivates the quirk in the w^l_{jk} notation mentioned earlier. If we used j to index the input neuron, and k to index the output neuron, then we’d need to replace the weight matrix in Equation (25) by the transpose of the weight matrix. That’s a small change, but annoying, and we’d lose the easy simplicity of saying (and thinking) “apply the weight matrix to the activations”..

That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we’ve taken to now. Think of it as a way of escaping index hell, while remaining precise about what’s going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization. Indeed, the code in the last chapter made implicit use of this expression to compute the behaviour of the network.

When using Equation (25) to compute a^l, we compute the intermediate quantity z^l \equiv w^l a^{l-1}+b^l along the way. This quantity turns out to be useful enough to be worth naming: we call z^l the weighted input to the neurons in layer l. We’ll make considerable use of the weighted input z^l later in the chapter. Equation (25) is sometimes written in terms of the weighted input, as a^l = \sigma(z^l). It’s also worth noting that z^l has components z^l_j = \sum_k w^l_{jk} a^{l-1}_k+b^l_j, that is, z^l_j  is just the weighted input to the activation function for neuron j in layer l.

───

 

試想假使那個神經網絡『輸出入』之圖示不是『由左到右』的畫,而是『從上往下』展現︰

neural-top-down

 

人們會怎麼想像 w^l_{jk} 的呢?若是一開始神經網絡『輸出入』就是『自右向左』為準則,難道不會更顯得自然的嗎??由於通常『矩陣』之『記號法』,『輸入』寫在『右』邊,『輸出』寫在『左』邊,因此直覺上或有『矛盾』乎!以至於才覺得『怪異』的吧!!

因為『物理數學』或『工程數學』的教科書,一般會把『矩陣』 M 用大寫字母表示,『向量』 v 或加上『→』 \vec v,如是 M_{jk} 就是該『矩陣』,第 j 『列』 row 第 k 『行』之『元素』;而 v_i 就是『行向量』 \vec v 的第 i 個『元素』的了。

如果依此將 (25) 式子改寫為︰

{\vec a}^{\ l} \ = \ \sigma( W^{\ l} \ {\vec a}^{\ l-1}  \ + \ {\vec b}^{\ l})

是否更能清楚表達第 l 層之『激活』 activation {\vec a}^{\ l} 與前一層 l-1 之『激活』 {\vec a}^{\ l-1} 的關係呢??!!如此可將所謂的『向量函式』 \vec y \ = \ f ( \vec x) 理解為 y_i \ = \ f(x_i) 也將順理成章耶!!??

切莫輕忽『記號法』以及『再表現』之重要性,往往它們正是深化理解的門徑,希望短短一段文本足以引人入勝也︰

OIABOIAB
OOOOOOOIAB
IOIABIIOBA
AOABIAABOI
BOBIABBAIO

220px-Rubik's_cube.svg

220px-NegativeOne3Root.svg

220px-Clock_group.svg

170px-Cyclic_group.svg

200px-Sixteenth_stellation_of_icosahedron

那麼要如何『了解』上面那個『』的『加法表』與『乘法表』呢?通常人們會自然的把 \bigoplus 看成『』,將 \bigodot 想為 『』。然而『數學』的一般『抽象結構』是由『規則』所『定義』的,很多講的是某個『集合』內之『元素』所具有的『性質』,以及『運算』所滿足的『定律』。這與有沒有人們所『熟悉的』類似結構無關,而且那些『元素』也未必得是個『』的啊!這或許就是『抽象數學』之所以『困難』的原因。雖然從『純粹』的『邏輯推理』能夠得到『結論』,只不過要是缺乏『經驗性』,人們通常『感覺』不實在、不具體、而且也不安心。就讓我們試著給這個『』一的比較容易『理解』之結構的『再現』︰設想將 O, I, A, B 表現在『複數平面』上,其中 O 是『原點』,而 I, A, B 位在『單位圓』之上,定義如下
O \equiv_{rp} \ 0 + i 0 = 0
I  \equiv_{rp} \ 1 + i 0 = 1
A \equiv_{rp} \ - \frac{1}{2} + i \frac{\sqrt{3}}{2}
B \equiv_{rp} \ - \frac{1}{2} - i \frac{\sqrt{3}}{2}

X \bigoplus Y  \equiv_{rp} \ -(X + Y), if X \neq Y
X \bigoplus Y  \equiv_{rp} \ (X - Y) = 0, if X = Y
X \bigodot Y \equiv_{rp} \  X \cdot Y

\because I \bigoplus A  \equiv_{rp} \ - \left[ 1 + \left( - \frac{1}{2} + i \frac{\sqrt{3}}{2} \right) \right]
= - \left( \frac{1}{2} + i \frac{\sqrt{3}}{2} \right) = B

A \bigodot A \equiv_{rp}  \ {\left( - \frac{1}{2} + i \frac{\sqrt{3}}{2} \right)}^2
= \frac{1}{4} - i \frac{\sqrt{3}}{2} - \frac{3}{4} = B

\therefore (I \bigoplus A) \bigoplus B = B \bigoplus B
= I \bigoplus A \bigoplus (A \bigodot A) = O

如果將它用複數改寫成 1 + A + A^2 = 0,這不就是『上上篇』裡的『三次方程式x^3 - 1 = (x - 1)(x^2 + x + 1) = 0 的『\omega 的嗎?再徵之以『相量』的『向量加法』和『旋轉乘法』,這個四個元素的『』之『喻義』也許可以『想像』的了。假使人們對於『抽象思考』一再重複的『練習』,那麼『邏輯推演』也將會是『經驗』中的了!就像俗語說的︰熟能生巧;『抽象的』也就成了『直覺』上的了!!

─── 摘自《【Sonic π】電路學之補充《四》無窮小算術‧下上