W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】三

乍看之下, Michael Nielsen 先生之此段文本︰

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives \partial C / \partial w and \partial C / \partial b of the cost function C with respect to any weight w or bias b in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it’s useful to have an example cost function in mind. We’ll use the quadratic cost function from last chapter (c.f. Equation (6)). In the notation of the last section, the quadratic cost has the form

C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2, \ \ \ \ (26)

where: n is the total number of training examples; the sum is over individual training examples, x; y=y(x) is the corresponding desired output; L denotes the number of layers in the network; and a^L = a^L(x) is the vector of activations output from the network when x is input.

Okay, so what assumptions do we need to make about our cost function, C, in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average C = \frac{1}{n} \sum_x C_x over cost functions C_x for individual training examples, x. This is the case for the quadratic cost function, where the cost for a single training example is C_x = \frac{1}{2} \|y-a^L \|^2. This assumption will also hold true for all the other cost functions we’ll meet in this book.

The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives \partial C_x / \partial w and \partial C_x / \partial b for a single training example. We then recover \partial C / \partial w and \partial C / \partial b by averaging over training examples. In fact, with this assumption in mind, we’ll suppose the training example x has been fixed, and drop the x subscript, writing the cost C_x as C. We’ll eventually put the x back in, but for now it’s a notational nuisance that is better left implicit.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example x may be written as

C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2, \ \ \ \ (27)

and thus is a function of the output activations. Of course, this cost function also depends on the desired output y, and you may wonder why we’re not regarding the cost also as a function of y. Remember, though, that the input training example x is fixed, and so the output y is also a fixed parameter. In particular, it’s not something we can modify by changing the weights and biases in any way, i.e., it’s not something which the neural network learns. And so it makes sense to regard C as a function of the output activations a^L alone, with y merely a parameter that helps define that function.

───

 

簡單易明!其實梳理實在費事?首先我們將式子 (26) 改寫如下

C = \frac{1}{n} \cdot \left( \frac{1}{2} \sum_x \| \vec y (\vec x)-{\vec a}^L(\vec x)\|^2 \right)

假使從『黑箱』角度來看,如果『輸入』是 \vec x ,神經網絡之『目標輸出』是 \vec y ,當下『訓練』之『輸出』是 {\vec a}^L 。如是對『訓練樣本』 \vec x 而言,如將此神經網絡的『目標誤差』定義為 C_x \equiv \frac{1}{2} \| \vec y-{\vec a}^L\|^2 。那麼『函式』 C 就是所有之 n 個『訓練樣本』的『平均』『目標誤差』。若是仔細審查 C_x = \frac{1}{2} \sum_i (y_i-a^L_i)^2 的定義,將可發現它就是兩向量之『歐式距離』之半,其實是個『純量』。而且所謂的『訓練樣本』 \vec x  以及『目標輸出』 \vec y 事實上都只是『參數』︰

Parameter

A parameter (from the Ancient Greek παρά, “para”, meaning “beside, subsidiary” and μέτρον, “metron”, meaning “measure”), in its common meaning, is a characteristic, feature, or measurable factor that can help in defining a particular system. A parameter is an important element to consider in evaluation or comprehension of an event, project, or situation. Parameter has more specific interpretations in mathematics, logic, linguistics, environmental science, and other disciplines.[1]

Mathematical functions

Mathematical functions have one or more arguments that are designated in the definition by variables. A function definition can also contain parameters, but unlike variables, parameters are not listed among the arguments that the function takes. When parameters are present, the definition actually defines a whole family of functions, one for every valid set of values of the parameters. For instance, one could define a general quadratic function by declaring

f(x)=ax^2+bx+c;

here, the variable x designates the function’s argument, but a, b, and c are parameters that determine which particular quadratic function is being considered. A parameter could be incorporated into the function name to indicate its dependence on the parameter. For instance, one may define the base b of a logarithm by

\log_b(x)=\frac{\log(x)}{\log(b)}

where b is a parameter that indicates which logarithmic function is being used. It is not an argument of the function, and will, for instance, be a constant when considering the derivative \textstyle\log_b'(x).

In some informal situations it is a matter of convention (or historical accident) whether some or all of the symbols in a function definition are called parameters. However, changing the status of symbols between parameter and variable changes the function as a mathematical object. For instance, the notation for the falling factorial power

n^{\underline k}=n(n-1)(n-2)\cdots(n-k+1),

defines a polynomial function of n (when k is considered a parameter), but is not a polynomial function of k (when n is considered a parameter). Indeed, in the latter case, it is only defined for non-negative integer arguments. More formal presentations of such situations typically start out with a function of several variables (including all those that might sometimes be called “parameters”) such as

(n,k) \mapsto n^{\underline{k}}

as the most fundamental object being considered, then defining functions with fewer variables from the main one by means of currying.

───

 

依據『此樣本』與『彼樣本』而固定,故而無從改變。唯一能變的是『神經網絡』之『輸出』 {\vec a}^L ,這恰是『學習演算法』之目的也 !!!那為什麼要用『之半』 \frac{1}{2} 係數的呢?因為 C_x 是『二次型』,在求『導數』時將會多個『乘數因子』 2 , 如是將 2 \cdot \frac{1}{2} = 1 的乎 ???當然『樣本數』 n 也是『參數』,所以說 C 只是 {\vec a}^L 的函式矣。再由『激勵』之層層間的關係︰

{\vec a}^{\ l} \ = \ \sigma( W^{\ l} \ {\vec a}^{\ l-1}  \ + \ {\vec b}^{\ l})

,故知 C 也可以表示成

W^L, W^{L-1}, \cdots , W^l, \cdots , {\vec b}^L , {\vec b}^{\ L-1}, \cdots ,{\vec b}^l , \cdots

之函式的了。

何不就趁此機會讀讀《λ 運算︰……》系列文本,了解一下什麼是『變元』?什麼是『函式』的耶︰

首先請讀者參考在《Thue 之改寫系統《一》》一文中的『符號定義』,於此我們引用該文中談到數學裡『函數定義』的一小段︰

如此當數學家說『函數f 的定義時︰

假使有兩個集合 ST,將之稱作『定義域』domain 與『對應域』codomain,函數 fS \times T 的子集,並且滿足

\forall x \ x \in S \  \exists ! \ y \ y \in T \ \wedge \ (x,y) \in f

,記作 x \mapsto y = f (x),『 \exists \  !  』是指『恰有一個』,就一點都不奇怪了吧。同樣『二元運算』假使『簡記』成 X \times Y \mapsto_{\bigoplus} \  Z  ,X=Y=Z=S,是講︰

z = \bigoplus ( x, y) = x \bigoplus y,也是很清晰明白的呀!!

200px-Injection_keine_Injektion_2a.svg

200px-Injection_keine_Injektion_1.svg

220px-Function_color_example_3.svg

220px-Function_machine2.svg

350px-Function_machine5.svg

如果仔細考察 y = f(x) ── 比方說 y = x^2 ──,那麼『函數 f』是什麼呢?『變數 x, y』又是什麼呢?如果從函數定義可以知道『變數』並不是什麼『會變的數』,而是規定在『定義域』或者『對應域』中的『某數』的概念,也就是講在該定義的『集合元素中』談到『每一個』、『有一個』和『恰有一個』…的那樣之『指稱』觀念。這能有什麼困難的嗎?假使設想另一個函數 z = w^2,它的定義域與對應域都和函數 y = x^2 一樣,那麼這兩個函數是一樣還是不一樣的呢?如果說它們是相同的函數,那麼這個所說的『函數』就該是『\Box^2』,其中 y,  z變數』只是『命名的』── 函數的輸出之數 ──,而且 w, x變數』是『虛名的』── 函數的輸入之數 ──。如果從函數 f 將『輸入的數』轉換成『輸出的數』的觀點來看,這個『輸入與輸出』本就是 f 所『固有的』,所以和『輸入與輸出』到底是怎麼『命名』無關的啊!更何況『定義域或對應域』任一也都不必是『數的集合』,這時所講的『函數』或許稱作『函式』比較好,『變數』或該叫做『變元』。其次假使將多個函數『合成』composition,好比『輸出入』的串接,舉例來講,一般數學上表達成 g(f(x)) = x^2 + 1,此時假使不補足,g(x) = x + 1f(x) = x^2,怕是不能知道這個函數的『結構』是什麼的吧?進一步講『函數』難道不能看成『計算操作子』operator 的概念,定義著什麼是f + gf - gf * gf / g 的嗎?就像將之這樣定義成︰
(f \otimes g) (x) \ =_{df} \  f(x) \otimes g(x)

,而將函數合成這麼定義為︰
(f (g) ) (x) \ =_{df} \  f(g(x))

。如此將使『函數』或者『二元運算』的定義域或對應域可以含括『函數』的物項,所以說它是『泛函式』functional 的了。

再者將函式的定義域由一數一物推廣到『有序元組』turple 也是很自然的事,就像講房間裡的『溫度函數』是 T (x, y, z) 一樣,然而這也產生了另一種表達的問題。假想 f(x) = x^2 - y^2g(y) = x^2 - y^2h(x, y) = x^2 - y^2,這 f, g 兩個函數都是 h 函數的『部份』partial 函數,構成了兩個不同的『函數族』。於是在一個運算過程中,這個表達式『 x^2 - y^2』究竟是指什麼?是指『f』還是指『g』呢?也許說不定是指『h』的呢?難道說『兩平方數之差』本身就沒有意義的嗎??因是之故,邱奇所發展的『λ 記號法』是想要『清晰明白』的『表述』一個『表達式』所說之內容到底是指的什麼。如果使用這個記號法,f, g, h 記作︰

f \ =_{df} \ \lambda x. \ x^2 - y^2

g \ =_{df} \ \lambda y. \ x^2 - y^2

h \ =_{df} \ \lambda x. \lambda y. \ x^2 - y^2

。那麼之前的 g(f(x)) 也可以寫成了︰

\lambda z.  \ ( \lambda y.  \ y + 1) (( \lambda x. \ x^2)\  z)

── 說是清晰明白的事理,表達起來卻未必是清楚易懂 ──

─── 摘自《λ 運算︰淵源介紹