W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】三

2016-05-15 懸鉤子

乍看之下， Michael Nielsen 先生之此段文本︰

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives $\partial C / \partial w$ $\partial C / \partial w$ and $\partial C / \partial b$ $\partial C / \partial b$ of the cost function $C$ $C$ with respect to any weight $w$ or bias $b$ in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it’s useful to have an example cost function in mind. We’ll use the quadratic cost function from last chapter (c.f. Equation (6)). In the notation of the last section, the quadratic cost has the form

C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2, \ \ \ \ (26)

where: $n$ $n$ is the total number of training examples; the sum is over individual training examples, $x$ ; $x$ $y=y(x)$ $y = y (x)$ is the corresponding desired output; $L$ $L$ denotes the number of layers in the network; and $a^L = a^L(x)$ $a^{L} = a^{L} (x)$ is the vector of activations output from the network when $x$ $x$ is input.

Okay, so what assumptions do we need to make about our cost function, $C$ $C$ , in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average $C = \frac{1}{n} \sum_x C_x$ $C = \frac{1}{n} \sum_{x} C_{x}$ over cost functions $C_x$ $C_{x}$ for individual training examples, $x$ $x$ . This is the case for the quadratic cost function, where the cost for a single training example is $C_x = \frac{1}{2} \|y-a^L \|^2$ $C_{x} = \frac{1}{2} ∥ y - a^{L} ∥^{2}$ . This assumption will also hold true for all the other cost functions we’ll meet in this book.

The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives $\partial C_x / \partial w$ $\partial C_{x} / \partial w$ and $\partial C_x / \partial b$ $\partial C_{x} / \partial b$ for a single training example. We then recover $\partial C / \partial w$ $\partial C / \partial w$ and $\partial C / \partial b$ $\partial C / \partial b$ by averaging over training examples. In fact, with this assumption in mind, we’ll suppose the training example $x$ $x$ has been fixed, and drop the $x$ $x$ subscript, writing the cost $C_x$ $C_{x}$ as $C$ $C$ . We’ll eventually put the $x$ back in, but for now it’s a notational nuisance that is better left implicit.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example

x

may be written as

C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2, \ \ \ \ (27)

and thus is a function of the output activations. Of course, this cost function also depends on the desired output $y$ $y$ , and you may wonder why we’re not regarding the cost also as a function of $y$ $y$ . Remember, though, that the input training example $x$ $x$ is fixed, and so the output $y$ $y$ is also a fixed parameter. In particular, it’s not something we can modify by changing the weights and biases in any way, i.e., it’s not something which the neural network learns. And so it makes sense to regard $C$ $C$ as a function of the output activations $a^L$ $a^{L}$ alone, with $y$ $y$ merely a parameter that helps define that function.

───

簡單易明！其實梳理實在費事？首先我們將式子 (26) 改寫如下

$\begin{array}{rcl} (26) & C = \frac{1}{2 n} \sum_{x} ∥ y (x) - a^{L} (x) ∥^{2}, \end{array}$

。

$\begin{array}{rcl} (27) & C = \frac{1}{2} ∥ y - a^{L} ∥^{2} = \frac{1}{2} \sum_{j} (y_{j} - a_{j}^{L})^{2}, \end{array}$ $C_x \equiv \frac{1}{2} \| \vec y-{\vec a}^L\|^2$ 。那麼『函式』 $C$ 就是所有之 $n$ 個『訓練樣本』的『平均』『目標誤差』。若是仔細審查 $C_x = \frac{1}{2} \sum_i (y_i-a^L_i)^2$ 的定義，將可發現它就是兩向量之『歐式距離』之半，其實是個『純量』。而且所謂的 $\begin{array}{rcl} (27) & C = \frac{1}{2} ∥ y - a^{L} ∥^{2} = \frac{1}{2} \sum_{j} (y_{j} - a_{j}^{L})^{2}, \end{array}$

Parameter

A parameter (from the Ancient Greek παρά, “para”, meaning “beside, subsidiary” and μέτρον, “metron”, meaning “measure”), in its common meaning, is a characteristic, feature, or measurable factor that can help in defining a particular system. A parameter is an important element to consider in evaluation or comprehension of an event, project, or situation. Parameter has more specific interpretations in mathematics, logic, linguistics, environmental science, and other disciplines.^[1]

Mathematical functions

Mathematical functions have one or more arguments that are designated in the definition by variables. A function definition can also contain parameters, but unlike variables, parameters are not listed among the arguments that the function takes. When parameters are present, the definition actually defines a whole family of functions, one for every valid set of values of the parameters. For instance, one could define a general quadratic function by declaring

f(x)=ax^2+bx+c

;

here, the variable x designates the function’s argument, but a, b, and c are parameters that determine which particular quadratic function is being considered. A parameter could be incorporated into the function name to indicate its dependence on the parameter. For instance, one may define the base b of a logarithm by

\log_b(x)=\frac{\log(x)}{\log(b)}

where b is a parameter that indicates which logarithmic function is being used. It is not an argument of the function, and will, for instance, be a constant when considering the derivative $\textstyle\log_b'(x)$ .

In some informal situations it is a matter of convention (or historical accident) whether some or all of the symbols in a function definition are called parameters. However, changing the status of symbols between parameter and variable changes the function as a mathematical object. For instance, the notation for the falling factorial power

n^{\underline k}=n(n-1)(n-2)\cdots(n-k+1)

defines a polynomial function of n (when k is considered a parameter), but is not a polynomial function of k (when n is considered a parameter). Indeed, in the latter case, it is only defined for non-negative integer arguments. More formal presentations of such situations typically start out with a function of several variables (including all those that might sometimes be called “parameters”) such as

(n,k) \mapsto n^{\underline{k}}

as the most fundamental object being considered, then defining functions with fewer variables from the main one by means of currying.

───

依據『此樣本』與『彼樣本』而固定，故而無從改變。唯一能變的是『神經網絡』之『輸出』 ${\vec a}^L$ ，這恰是『學習演算法』之目的也！！！那為什麼要用『之半』 $\frac{1}{2}$ 係數的呢？因為 $C_x$ 是『二次型』，在求『導數』時將會多個『乘數因子』 $2$ ，如是將 $2 \cdot \frac{1}{2} = 1$ 的乎？？？當然『樣本數』 $n$ 也是『參數』，所以說 $C$ 只是 ${\vec a}^L$ 的函式矣。再由『激勵』之層層間的關係︰

${\vec a}^{\ l} \ = \ \sigma( W^{\ l} \ {\vec a}^{\ l-1} \ + \ {\vec b}^{\ l})$

，故知 $C$ 也可以表示成

$W^L, W^{L-1}, \cdots , W^l, \cdots , {\vec b}^L , {\vec b}^{\ L-1}, \cdots ,{\vec b}^l , \cdots$

之函式的了。

何不就趁此機會讀讀《λ 運算︰……》系列文本，了解一下什麼是『變元』？什麼是『函式』的耶︰

首先請讀者參考在《Thue 之改寫系統《一》》一文中的『符號定義』，於此我們引用該文中談到數學裡『函數定義』的一小段︰

如此當數學家說『函數』 $f$ 的定義時︰

假使有兩個集合 $S$ 和 $T$ ，將之稱作『定義域』domain 與『對應域』codomain，函數 $f$ 是 $S \times T$ 的子集，並且滿足

$\forall x \ x \in S \ \exists ! \ y \ y \in T \ \wedge \ (x,y) \in f$

，記作 $x \mapsto y = f (x)$ ，『 $\exists \ !$ 』是指『恰有一個』，就一點都不奇怪了吧。同樣『二元運算』假使『簡記』成 $X \times Y \mapsto_{\bigoplus} \ Z$ ， $X=Y=Z=S$ ，是講︰

$z = \bigoplus ( x, y) = x \bigoplus y$ ，也是很清晰明白的呀！！

如果仔細考察 $y = f(x)$ ── 比方說 $y = x^2$ ──，那麼『函數 $f$ 』是什麼呢？『變數 $x, y$ 』又是什麼呢？如果從函數定義可以知道『變數』並不是什麼『會變的數』，而是規定在『定義域』或者『對應域』中的『某數』的概念，也就是講在該定義的『集合元素中』談到『每一個』、『有一個』和『恰有一個』…的那樣之『指稱』觀念。這能有什麼困難的嗎？假使設想另一個函數 $z = w^2$ ，它的定義域與對應域都和函數 $y = x^2$ 一樣，那麼這兩個函數是一樣還是不一樣的呢？如果說它們是相同的函數，那麼這個所說的『函數』就該是『 $\Box^2$ 』，其中 $y, z$ 『變數』只是『命名的』── 函數的輸出之數 ──，而且 $w, x$ 『變數』是『虛名的』── 函數的輸入之數 ──。如果從函數 $f$ 將『輸入的數』轉換成『輸出的數』的觀點來看，這個『輸入與輸出』本就是 $f$ 所『固有的』，所以和『輸入與輸出』到底是怎麼『命名』無關的啊！更何況『定義域或對應域』任一也都不必是『數的集合』，這時所講的『函數』或許稱作『函式』比較好，『變數』或該叫做『變元』。其次假使將多個函數『合成』composition，好比『輸出入』的串接，舉例來講，一般數學上表達成 $g(f(x)) = x^2 + 1$ ，此時假使不補足， $g(x) = x + 1$ 和 $f(x) = x^2$ ，怕是不能知道這個函數的『結構』是什麼的吧？進一步講『函數』難道不能看成『計算操作子』operator 的概念，定義著什麼是 $f + g$ 、 $f - g$ 、 $f * g$ 或 $f / g$ 的嗎？就像將之這樣定義成︰
$(f \otimes g) (x) \ =_{df} \ f(x) \otimes g(x)$

，而將函數合成這麼定義為︰
$(f (g) ) (x) \ =_{df} \ f(g(x))$

。如此將使『函數』或者『二元運算』的定義域或對應域可以含括『函數』的物項，所以說它是『泛函式』functional 的了。

再者將函式的定義域由一數一物推廣到『有序元組』turple 也是很自然的事，就像講房間裡的『溫度函數』是 $T (x, y, z)$ 一樣，然而這也產生了另一種表達的問題。假想 $f(x) = x^2 - y^2$ 、 $g(y) = x^2 - y^2$ 和 $h(x, y) = x^2 - y^2$ ，這 $f, g$ 兩個函數都是 $h$ 函數的『部份』partial 函數，構成了兩個不同的『函數族』。於是在一個運算過程中，這個表達式『 $x^2 - y^2$ 』究竟是指什麼？是指『 $f$ 』還是指『 $g$ 』呢？也許說不定是指『 $h$ 』的呢？難道說『兩平方數之差』本身就沒有意義的嗎？？因是之故，邱奇所發展的『λ 記號法』是想要『清晰明白』的『表述』一個『表達式』所說之內容到底是指的什麼。如果使用這個記號法， $f, g, h$ 記作︰

$f \ =_{df} \ \lambda x. \ x^2 - y^2$

$g \ =_{df} \ \lambda y. \ x^2 - y^2$

$h \ =_{df} \ \lambda x. \lambda y. \ x^2 - y^2$

。那麼之前的 $g(f(x))$ 也可以寫成了︰

$\lambda z. \ ( \lambda y. \ y + 1) (( \lambda x. \ x^2)\ z)$ 。

── 說是清晰明白的事理，表達起來卻未必是清楚易懂 ──

─── 摘自《λ 運算︰淵源介紹》

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】二

2016-05-14 懸鉤子

為什麼 Michael Nielsen 要講他用的『記號法』初看有點『怪異』的呢？？又為什麼看久之後反倒覺得『自然』的哩！！

The following diagram shows examples of these notations in use:

With these notations, the activation $a^l_j$

a_{j}^{l}

of the $j^{\rm th}$

j^{t h}

neuron in the $l^{\rm th}$

l^{t h}

layer is related to the activations in the

(l - 1)^{t h}

layer by the equation (compare Equation (4) and surrounding discussion in the last chapter)

$a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \ \ \ \ (23)$

where the sum is over all neurons $k$ $k$ in the $(l-1)^{\rm th}$ $(l - 1)^{t h}$ layer. To rewrite this expression in a matrix form we define a weight matrix $w^l$ $w^{l}$ for each layer, $l$ $l$ . The entries of the weight matrix $w^l$ $w^{l}$ are just the weights connecting to the $l^{\rm th}$ $l^{t h}$ layer of neurons, that is, the entry in the $j^{\rm th}$ $j^{t h}$ row and $k^{\rm th}$ $k^{t h}$ column is $w^l_{jk}$ $w_{j k}^{l}$ . Similarly, for each layer $l$ we define a bias vector, $b^l$ $b^{l}$ . You can probably guess how this works – the components of the bias vector are just the values $b^l_j$ $b_{j}^{l}$ , one component for each neuron in the $l^{\rm th}$ $l^{t h}$ layer. And finally, we define an activation vector $a^l$ $a^{l}$ whose components are the activations $a^l_j$ $a_{j}^{l}$ .

The last ingredient we need to rewrite (23) in a matrix form is the idea of vectorizing a function such as $\sigma$ $σ$ . We met vectorization briefly in the last chapter, but to recap, the idea is that we want to apply a function such as $\sigma$ $σ$ to every element in a vector $v$ $v$ . We use the obvious notation $\sigma (v)$ $σ (v)$ to denote this kind of elementwise application of a function. That is, the components of $\sigma (v)$ $σ (v)$ are just ${\sigma (v)}_j = \sigma (v_j)$ $σ (v)_{j} = σ (v_{j})$ . As an example, if we have the function $f(x) = x^2$ $f (x) = x^{2}$ then the vectorized form of $f$ has the effect

f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right) = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right] = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right], \ \ \ \ (24)

that is, the vectorized $f$ just squares every element of the vector.

With these notations in mind, Equation (23) can be rewritten in the beautiful and compact vectorized form

a^{l} = \sigma(w^l a^{l-1}+b^l). \ \ \ \ (25)

This expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the $σ$ function*

*By the way, it’s this expression that motivates the quirk in the $w^l_{jk}$ $w_{j k}^{l}$ notation mentioned earlier. If we used $j$ $j$ to index the input neuron, and $k$ $k$ to index the output neuron, then we’d need to replace the weight matrix in Equation (25) by the transpose of the weight matrix. That’s a small change, but annoying, and we’d lose the easy simplicity of saying (and thinking) “apply the weight matrix to the activations”..

That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we’ve taken to now. Think of it as a way of escaping index hell, while remaining precise about what’s going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization. Indeed, the code in the last chapter made implicit use of this expression to compute the behaviour of the network.

When using Equation (25) to compute $a^l$ $a^{l}$ , we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$ $z^{l} \equiv w^{l} a^{l - 1} + b^{l}$ along the way. This quantity turns out to be useful enough to be worth naming: we call $z^l$ $z^{l}$ the weighted input to the neurons in layer $l$ $l$ . We’ll make considerable use of the weighted input $z^l$ $z^{l}$ later in the chapter. Equation (25) is sometimes written in terms of the weighted input, as $a^{l} = σ (z^{l})$ . It’s also worth noting that $z^l$ $z^{l}$ has components $z^l_j = \sum_k w^l_{jk} a^{l-1}_k+b^l_j$ $z_{j}^{l} = \sum_{k} w_{j k}^{l} a_{k}^{l - 1} + b_{j}^{l}$ , that is, $z^l_j$ is just the weighted input to the activation function for neuron $j$ $j$ in layer $l$ $l$ .

───

試想假使那個神經網絡『輸出入』之圖示不是『由左到右』的畫，而是『從上往下』展現︰

neural-top-down

人們會怎麼想像 $w^l_{jk}$ 的呢？若是一開始神經網絡『輸出入』就是『自右向左』為準則，難道不會更顯得自然的嗎？？由於通常『矩陣』之『記號法』，『輸入』寫在『右』邊，『輸出』寫在『左』邊，因此直覺上或有『矛盾』乎！以至於才覺得『怪異』的吧！！

因為『物理數學』或『工程數學』的教科書，一般會把『矩陣』 $M$ 用大寫字母表示，『向量』 $v$ 或加上『→』 $\vec v$ ，如是 $M_{jk}$ 就是該『矩陣』，第 $j$ 『列』 row 第 $k$ 『行』之『元素』；而 $v_i$ 就是『行向量』 $\vec v$ 的第 $i$ 個『元素』的了。

如果依此將 (25) 式子改寫為︰

${\vec a}^{\ l} \ = \ \sigma( W^{\ l} \ {\vec a}^{\ l-1} \ + \ {\vec b}^{\ l})$

是否更能清楚表達第 $l$ 層之『激活』 activation ${\vec a}^{\ l}$ 與前一層 $l-1$ 之『激活』 ${\vec a}^{\ l-1}$ 的關係呢？？！！如此可將所謂的『向量函式』 $\vec y \ = \ f ( \vec x)$ 理解為 $y_i \ = \ f(x_i)$ 也將順理成章耶！！？？

切莫輕忽『記號法』以及『再表現』之重要性，往往它們正是深化理解的門徑，希望短短一段文本足以引人入勝也︰

⊙	O	I	A	B	⊕	O	I	A	B
O	O	O	O	O	O	O	I	A	B
I	O	I	A	B	I	I	O	B	A
A	O	A	B	I	A	A	B	O	I
B	O	B	I	A	B	B	A	I	O

那麼要如何『了解』上面那個『體』的『加法表』與『乘法表』呢？通常人們會自然的把 $\bigoplus$ 看成『加』，將 $\bigodot$ 想為『乘』。然而『數學』的一般『抽象結構』是由『規則』所『定義』的，很多講的是某個『集合』內之『元素』所具有的『性質』，以及『運算』所滿足的『定律』。這與有沒有人們所『熟悉的』類似結構無關，而且那些『元素』也未必得是個『數』的啊！這或許就是『抽象數學』之所以『困難』的原因。雖然從『純粹』的『邏輯推理』能夠得到『結論』，只不過要是缺乏『經驗性』，人們通常『感覺』不實在、不具體、而且也不安心。就讓我們試著給這個『體』一的比較容易『理解』之結構的『再現』︰設想將 $O, I, A, B$ 表現在『複數平面』上，其中 $O$ 是『原點』，而 $I, A, B$ 位在『單位圓』之上，定義如下
$O \equiv_{rp} \ 0 + i 0 = 0$
$I \equiv_{rp} \ 1 + i 0 = 1$
$A \equiv_{rp} \ - \frac{1}{2} + i \frac{\sqrt{3}}{2}$
$B \equiv_{rp} \ - \frac{1}{2} - i \frac{\sqrt{3}}{2}$

$X \bigoplus Y \equiv_{rp} \ -(X + Y), if X \neq Y$
$X \bigoplus Y \equiv_{rp} \ (X - Y) = 0, if X = Y$
$X \bigodot Y \equiv_{rp} \ X \cdot Y$

$\because I \bigoplus A \equiv_{rp} \ - \left[ 1 + \left( - \frac{1}{2} + i \frac{\sqrt{3}}{2} \right) \right]$
$= - \left( \frac{1}{2} + i \frac{\sqrt{3}}{2} \right) = B$

$A \bigodot A \equiv_{rp} \ {\left( - \frac{1}{2} + i \frac{\sqrt{3}}{2} \right)}^2$
$= \frac{1}{4} - i \frac{\sqrt{3}}{2} - \frac{3}{4} = B$

$\therefore (I \bigoplus A) \bigoplus B = B \bigoplus B$
$= I \bigoplus A \bigoplus (A \bigodot A) = O$

如果將它用複數改寫成 $1 + A + A^2 = 0$ ，這不就是『上上篇』裡的『三次方程式』 $x^3 - 1 = (x - 1)(x^2 + x + 1) = 0$ 的『解』 $\omega$ 的嗎？再徵之以『相量』的『向量加法』和『旋轉乘法』，這個四個元素的『體』之『喻義』也許可以『想像』的了。假使人們對於『抽象思考』一再重複的『練習』，那麼『邏輯推演』也將會是『經驗』中的了！就像俗語說的︰熟能生巧；『抽象的』也就成了『直覺』上的了！！

─── 摘自《【Sonic π】電路學之補充《四》無窮小算術‧下上》

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】一

2016-05-13 懸鉤子

談及『反向傳播算法』之前， Michael Nielsen 先生首起『記號法』之用︰

……

Let’s begin with a notation which lets us refer to weights in the network in an unambiguous way. We’ll use $w^l_{jk}$ $w_{j k}^{l}$ to denote the weight for the connection from the $k^{\rm th}$ $k^{t h}$ neuron in the $(l-1)^{\rm th}$ $(l - 1)^{t h}$ layer to the $j^{\rm th}$ $j^{t h}$ neuron in the $l^{\rm th}$ $l^{t h}$ layer. So, for example, the diagram below shows the weight on a connection from the fourth neuron in the second layer to the second neuron in the third layer of a network:

This notation is cumbersome at first, and it does take some work to master. But with a little effort you’ll find the notation becomes easy and natural. One quirk of the notation is the ordering of the

j

and $k$

k

indices. You might think that it makes more sense to use $j$

j

to refer to the input neuron, and

k

to the output neuron, not vice versa, as is actually done. I’ll explain the reason for this quirk below.

We use a similar notation for the network’s biases and activations. Explicitly, we use $b^l_j$ $b_{j}^{l}$ for the bias of the $j^{\rm th}$ $j^{t h}$ neuron in the $l^{t h}$ layer. And we use $a^l_j$ $a_{j}^{l}$ for the activation of the $j^{\rm th}$ $j^{t h}$ neuron in the $l^{\rm th}$ $l^{t h}$ layer. The following diagram shows examples of these notations in use:

With these notations, the activation $a^l_j$

a_{j}^{l}

of the $j^{\rm th}$

j^{t h}

neuron in the $l^{\rm th}$

l^{t h}

layer is related to the activations in the

(l - 1)^{t h}

layer by the equation (compare Equation (4) and surrounding discussion in the last chapter)

$a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \ \ \ \ (23)$

───

果有深意哉？？如果設想使用『羅馬數字』作算術

$XIV \times LXX$

，雖是大學者，恐怕一樣頭疼。假使記作

$14 \times 70$

，縱使小朋友，也能駕馭自如！！

恰似中國古代的『天元術』之所以難學難傳的乎！！？？

如果考之以歷史，或許最早的『語詞算術』，可能源於中國古代的『天元術』︰

在中國數學史上最早創立天元概念的是北宋平陽蔣周所著的《益古集》，隨後有博陸李文一撰《照膽》，鹿泉石信道撰《鈐經》，平水劉汝諧撰《如積釋鎖》，處州李思聰《洞淵九容》後人才知道有天元。

李冶在東平獲得劉汝諧撰《如積釋鎖》，書中用十九個單字表示未知數的各個 $x^9$ 至 $x^-9$ 的冪：

仙、明、霄、漢、壘、層、高、上、天、人、地、下、低、減、落、逝、泉、暗、鬼；其中立天元在上。

後來有太原彭澤彥出，反其道而行，以天元在下^[2]。

《益古集》，《照膽》，《鈐經》，《如積釋鎖》，《洞淵九容》等早期天元術著作今已失傳。李冶在《測圓海鏡》中使用天元在上的天元術。後來李冶又著《益古演段》，採用天元在下的次序。朱世傑《四元玉鑒》和《算學啟蒙》卷下也採用天元在下的次序。

在天元術中，一次項係數旁記一「元」字（或在常數項旁記一「太」字）。

歷史上有兩種次序：

《測圓海鏡》式

「元」以上的係數表示各正次冪，「元」以下的係數表示常數項和各負次冪）。

例：李冶《測圓海鏡》第二卷第十四問方程： $-x^2-680x+96000=0$

元

《益古演段》式

「元」以下的係數表示各正次冪，「元」以上的係數表示常數和各負次冪

例一：

李冶《益古演段》卷中第三十六問中的方程= $3x^2+210x-20325$ 用天元術表示為:

太

元（x）

（

x^2

項）

其中「太」是常數項，算籌打斜線表示該項常數為負數。「元」相當於未知數x

對於東方早期古典『高階方程式論』有興趣之讀者，也許可以讀讀金代數學家李冶所著『測圓海鏡』，體會一下不同的『思路』。或將可以感受『符號學』的發展對於『數理邏輯』的貢獻，了解有時理解的難易就藏在『記號法』之中，畢竟『符號』也能有『美學』的吧！！

─ 摘自《勇闖新世界︰《 pyDatalog 》【專題】之約束編程‧一》

故而對此『記號法』務先嫻熟善用的耶！！

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【FFT】八

2016-05-12 懸鉤子

自從愛因斯坦用『光子』 ── 頻率為 $\nu$ 之光量子，攜帶 $h \nu$ 的能量 ── 來解釋『光電效應』，就開啟了『光』之『波‧粒二重性』。其後德布羅意設想

物質波

物理學中，物質波（即德布羅意波）係指所有物質的波（見波粒二象性）。

德布羅意說明了波長和動量成反比；頻率和總能成正比之關係，是路易·德布羅意於1923年在他的博士論文提出的。

第一德布羅意方程式指出，粒子波長λ（亦稱德布羅意波長）和動量p的關係：（下式中普朗克常數h、粒子靜質量m、粒子速度v、勞侖茲因子γ和真空光速c）

\lambda = \frac{h}{p} = \frac{h}{\gamma mv} = \frac{h}{mv} \sqrt{1 - \frac{v^2}{c^2} }

第二德布羅意方程式指出頻率ν和總能E的關係：

\nu = \frac{E}{h} = \frac{\gamma\,mc^2}{h} = \frac {1}{\sqrt{1 - \frac{v^2}{c^2}}} \cdot \frac{mc^2}{h}

這兩個式子通常寫作

p = {h \over \lambda} = \hbar {2\pi \over \lambda} = \hbar k \,

E = h \nu = \hbar \cdot 2\pi \nu = \hbar \omega \,

平面波

波包

德布羅意波的1維傳播，複值波幅的實部以藍色表示、虛部以綠色表示。在某位置找到粒子的機率（以顏色的不透明度表示）呈波形狀延展。

───

說明『物質』具有『波動性』也自然順理成章。雖說

波包

在任意時刻，波包（wave packet）是局限在空間的某有限範圍區域內的波動，在其它區域的部分非常微小，可以被忽略。波包整體隨著時間流易移動於空間。波包可以分解為一組不同頻率、波數、相位、波幅的正弦波，也可以從同樣一組正弦波構成；在任意時刻，這些正弦波只會在空間的某有限範圍區域相長干涉，在其它區域會破壞性干涉。^[1]^:53-56^[2]^:312-313描繪波包輪廓的曲線稱為包絡線。依據不同的演化方程式，在傳播的時候，波包的包絡線（描繪波包輪廓的曲線）可能會保持不變（沒有色散），或者包絡線會改變（有色散）。

在量子力學裏，波包可以用來代表粒子，表示粒子的機率波；也就是說，表現於位置空間，波包在某時間、位置的波幅平方，就是找到粒子在那時間、位置的機率密度；在任意區域內，波包所囊括面積的絕對值平方，就是找到粒子處於那區域的機率。粒子的波包越狹窄，則粒子位置的不確定性越小，而動量的不確定性越大；反之亦然。這位置的不確定性和動量的不確定性，兩者之間無可避免的關係，是不確定性原理的一個標準案例。^[1]^:53-56

描述粒子的波包滿足薛丁格方程式，是薛丁格方程式的數學解。通過含時薛丁格方程式，可以預測粒子隨著時間演化的量子行為。這與在經典力學裏的哈密頓表述很類似。^[3]^:123

實線是波包，虛線是波包的包絡。當波包傳播於空間時，包絡以群速度移動。

───

的物理性質早為人所熟知。但講起『小波分析』卻是新鮮之事︰

小波分析

小波分析（wavelet analysis）或小波轉換（wavelet transform）是指用有限長或快速衰減的、稱為母小波（mother wavelet）的振盪波形來表示訊號。該波形被縮放和平移以匹配輸入的訊號。

小波一詞由Morlet和Grossman在1980年代早期提出。他們用的是法語詞ondelette – 意思就是”小波”。後來在英語裡，”onde”被改為”wave”而成了wavelet。

小波變換分成兩個大類：離散小波變換（DWT）和連續小波轉換（CWT）。兩者的主要區別在於，連續變換在所有可能的縮放和平移上操作，而離散變換採用所有縮放和平移值的特定子集。

小波理論和幾個其他課題相關。所有小波變換可以視為時域頻域表示的形式，所以和調和分析相關。所有實際有用的離散小波變換使用包含有限脈衝響應濾波器的濾波器段（filter band）。構成CWT的小波受海森堡的測不準原理制約，或者說，離散小波基可以在測不準原理的其他形式的情境中考慮。

母小波

簡單來說（技術上並非如此），母小波函數 $\psi\ (t)$ 必須滿足下列條件：

\int_{-\infty}^{\infty} |\psi (t)|\ ^2\, dt = 1

,也即

\psi\in L^2(\R)

並單位化

\int_{-\infty}^{\infty} |\psi\ (t)|\, dt <\infty

,也即

\psi\in L^1(\R)

\int_{-\infty}^{\infty} \psi\ (t)\, dt = 0

多數情況下，需要要求 $\psi$ 連續且有一個矩(moment)為0的大整數M，也即對所有整數m<M滿足

\int_{-\infty}^{\infty} t^m\,\psi\ (t)\, dt = 0

即母小波有M個消失矩(vanishing moment)，且M不等於0，這表示母小波必須不是常數且均值為0。

\int_{-\infty}^{\infty} \frac{|\overset{\frown}{\psi} (\omega)|^{2}}{|\omega|} d\omega <\infty\ \

，稱作可採納性(admissibility condition)，其中

\overset{\frown}{\psi}(\omega)

是

\psi(t)

的傅立葉轉換。

技術上來講，母小波必須滿足可採納性條件以使某個解析度的恆等成立。

根據Morlet的原始形式，母小波定義為

\psi _{a,b} (t) = {1 \over {\sqrt a }}\psi \left( {{{t - b} \over a}} \right)

其中a是縮放因子，當|a|<1時，母小波被壓縮，在時間軸上有較小的支撐度，並且對應到高頻，因為母小波變窄、變化變快，反之，當|a|>1時，母小波變寬、變化較慢，所以對應到低頻。b則是平移參數，用來決定母小波的位置。

根據訊號處理的測不準原理:

\Delta t \Delta \omega \geqq \frac{1}{2}

t 是時間，ω是角頻率(ω = 2πf，f是順時頻率)。

當時間解析度較高時，頻率解析度就會下降，反之，頻率解析度高時，時間解析度下降。當母小波或窗函數取的越寬， $\Delta t$ 的值越大。

當 $\Delta t$ 越大:

1. 縮放因子越大，對應低頻。

2. 頻率解析度高。

3. 時間解析度低。

當 $\Delta t$ 越小:

1. 縮放因子越小，對應高頻。

2. 頻率解析度低。

3. 時間解析度高。

雖然跟短時距傅立葉轉換一樣能同時分析時間和頻率，但是小波轉換在高頻的時間解析度較好，在低頻時則是頻率解析度較好，剛好符合我們對訊號分析在高低頻的解析度要求，因為在低頻時，例如頻率從1Hz變到2Hz，頻率差了一倍，所以頻率的變化相較時間的變化是比較明顯且重要的，然而在高頻時，例如頻率從1000Hz變到1001Hz，頻率相較時間的變化不大，所以對時間解析度的要求較高。但是短時距傅立葉轉換的解析度並不會隨著頻率而變化，下圖顯示兩者解析度變化的比較:

母小波的一些例子：

Wavelet_-_Meyer

Meyer

250px-Wavelet_-_Morlet

Morlet

Wavelet_-_Mex_Hat

墨西哥帽

……

和傅里葉變換比較

小波變換經常和傅里葉變換做比較，在後者中訊號用正弦函數的和來表示。

轉換形式	數學式	參數
傅立葉變換	$X(f) = \int_{-\infty}^{\infty} x(t)e^{-j2 \pi ft}\, dt$	f, 頻率
短時距傅立葉變換	$X(t, f) = \int_{-\infty}^{\infty} w(t-\tau)x(\tau) e^{-j 2 \pi f \tau} \, d\tau$	t, 時間; f, 頻率
小波轉換	$X(a,b) = \frac{1}{\sqrt{b}}\int_{-\infty}^{\infty}\ x(t){\Psi\left(\frac{t - a}{b}\right)} \, dt$	a, 時間; b, 尺度

標準的傅立葉變換將訊號從時域轉換到頻率域上做分析，但沒辦法從頻率域上得知訊號在不同時間的頻率資訊，只能知道該訊號包含哪些頻率成份，因此不適合用來分析一個頻率會隨著時間而改變的訊號，例如:音樂訊號。

然而短時距傅立葉變換（Short-time Fourier transform）（STFT）比傅立葉變換多了一個窗函數(window function)，可以分析出隨著時間變化的頻率,隨著窗函數大小的不同會有不同的頻率和時間解析度，以方形窗函數為例，當窗函數寬度越大，頻率的解析度越好，但時間解析度下降，反之，當窗函數寬度越小，時間的解析度越好，頻率解析度下降，然而有限長度的窗函數大小會限制頻率解析度，不過小波轉換能解決這個問題，通過多解析度分析通常可以給出更好的訊號表示。

另外，當輸入訊號為二維時(例如:影像)，短時距傅立葉變換的輸出為四維度，但小波轉換仍是二維訊號，所以在影像處理上通常會使用小波轉換而非短時距傅立葉變換。

小波變換的計算複雜度也更小，只需要 $\mathcal{O}(N)$ 時間，快於快速傅里葉變換的 $\mathcal{O}(N \log N)$ ，其中 $N$ 代表數據大小。

Fourier_transform

傅立葉轉換

Gabor_transform

STFT

可否藉此理解人耳之奧妙以及神奇的耶！！？？

音樂訊號分析

小波轉換亦可用在音樂訊號上，像是樂器自動辨識的應用，第一種為先使用一維小波轉換將聲音訊號分解為不同頻率範圍的各個頻帶，接著再對各個頻帶中擷取能量平均值以及能量標準差視為一維小波轉換之特徵向量。而第二種方法為先將聲音訊號轉成頻譜圖並視為一張二維影像，對此頻譜圖做二維小波轉換分解出各個頻帶，再對頻帶中擷取能量平均值和能量標準差做為二維小波的特徵向量。最後，利用相鄰近似法使用歐基里德距離來計算測試資料的特徵向量和每一樂器的特徵向量之距離，並取最小距離為辨識結果的樂器類別^[3]。

而小波轉換也常用在音樂訊號的壓縮，由於人耳對聲音各頻帶是有其感知力的，故有些頻帶人無法聽見，有些頻帶人耳特別靈敏。利用離散小波轉換來將音樂訊號做高低頻切割多次，就可以將原訊號分成許多子頻帶(sub-band)，但傳統離散小波轉換計算架構，將波型分成高頻與低頻後，下一次的切割只對低頻做切割，故沒辦法完全分割出與人耳感知頻帶相符合的子頻帶。於是更精細的計算架構被提出，稱為離散小波包轉換(discrete wavelet packet transform)，原理就是音樂訊號被分成高頻訊號後，會再做分割。一段音樂訊號就可以被分割成更貼近人耳25個頻帶的訊號，這樣的分割法更優於一般傅立葉分析所使用的濾波器，從這些子頻帶中，找出能夠被屏蔽的訊號，濾除之後，就可以將原本音樂訊號檔案大小壓縮了。

在辨識音樂訊號的樂譜上也有其應用，音樂訊號由一個個音符組成，而每個音符以特定的節奏出現，通常是成群的諧音出現，若要分辨出一段訊號最主要的頻率為何，必須濾除其泛音才能判斷，而由離散小波轉換的多重解析度分割就可以將泛音區隔在不同的子頻帶中，而且訊號中的雜訊也可以依同樣方法被濾除。由於是要偵測transient 現象，基於要偵測什麼樣的訊號就使用跟它很像的訊號當作基底拆解它這個原則，故在選擇小波基底時，就要選擇較有突然劇烈變化的母小波，如此一來小波轉換後的小波係數，能量就會聚集在原訊號有劇烈變化之處了^[4]，由此方法可有效辨識音樂訊號的音高(也就是頻率)。

───

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【FFT】七

2016-05-11 懸鉤子

如果了解了『頻譜圖』

Spectrogram

A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time or some other variable. Spectrograms are sometimes called spectral waterfalls, voiceprints, or voicegrams.

Spectrograms can be used to identify spoken words phonetically, and to analyse the various calls of animals. They are used extensively in the development of the fields of music, sonar, radar, and speech processing,^[1] seismology, etc.

The instrument that generates a spectrogram is called a spectrograph.

The sample outputs on the right show a select block of frequencies going up the vertical axis, and time on the horizontal axis.

Spectrogram-19thC

Typical spectrogram of the spoken words “nineteenth century”. The lower frequencies are more dense because it is a male voice. The legend to the right shows that the color intensity increases with the density.

1280px-Spectrogram_of_violin

Spectrogram of

1. the actual recording of this violin playing

. Note the harmonics occurring at whole-number multiples of the fundamental frequency. Note the fourteen draws of the bow, and the visual differences in the tones.

Spectrogram

3D surface spectrogram of a part from a music piece.

Format

A common format is a graph with two geometric dimensions: the horizontal axis represents time or rpm, the vertical axis is frequency; a third dimension indicating the amplitude of a particular frequency at a particular time is represented by the intensity or color of each point in the image.

There are many variations of format: sometimes the vertical and horizontal axes are switched, so time runs up and down; sometimes the amplitude is represented as the height of a 3D surface instead of color or intensity. The frequency and amplitude axes can be either linear or logarithmic, depending on what the graph is being used for. Audio would usually be represented with a logarithmic amplitude axis (probably in decibels, or dB), and frequency would be linear to emphasize harmonic relationships, or logarithmic to emphasize musical, tonal relationships.

Generation

Spectrograms are usually created in one of two ways: approximated as a filterbank that results from a series of bandpass filters (this was the only way before the advent of modern digital signal processing), or calculated from the time signal using the FFT. These two methods actually form two different Time-Frequency Distributions, but are equivalent under some conditions.

Spectrogram and waterfall of a 8MHz wide PAL-I Television signal.

The bandpass filters method usually uses analog processing to divide the input signal into frequency bands; the magnitude of each filter’s output controls a transducer that records the spectrogram as an image on paper.^[2]

Creating a spectrogram using the FFT is a digital process. Digitally sampled data, in the time domain, is broken up into chunks, which usually overlap, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk. Each chunk then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time. The spectrums or time plots are then “laid side by side” to form the image or a three-dimensional surface,^[3] or slightly overlapped in various ways, windowing.

The spectrogram of a signal s(t) can be estimated by computing the squared magnitude of the STFT of the signal s(t), as follows:^[4]

\mathrm{spectrogram}(t,\omega)=\left|\mathrm{STFT}(t,\omega)\right|^2

───

也許可以嘗試讀讀應用範例︰

Speech Recognition with BVLC caffe

Speech Recognition with the caffe deep learning framework

UPDATE: We are migrating to tensorflow

This project is quite fresh and only the first of three milestones is accomplished: Even now it might be useful if you just want to train a handful of commands/options (1,2,3..yes/no/cancel/…)

1) training spoken numbers:

get spectogram training images from http://pannous.net/spoken_numbers.tar (470 MB)
start ./train.sh
test with ipython notebook test-speech-recognition.ipynb or caffe test ... or <caffe-root>/python/classify.py
99% accuracy, nice!
online recognition and learning with ./recognition-server.py and ./record.py scripts

Sample spectrogram, Karen uttering ‘zero’ with 160 words per minute.

2) training words:

4GB of training data *
net topology: work in progress …
todo: use upcoming new caffe LSTM layers etc
UPDATE LSTMs get rolling, still not merged
UPDATE since the caffe project leaders have a hindering merging policy and this pull request was shifted many times without ever being merged, we are migrating to tensorflow
todo: add extra categories for a) silence b) common noises like typing, achoo c) ALL other noises

3) training speech:

todo!
100GB of training data here: http://www.openslr.org/12/
TIMIT dataset ＄27,000.00 membership fee or ＄250 for non-members+＄2400 under research-only license?
combine with google n-grams

───

認識一下『Caffe』是何物︰

Caffe

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.

Check out our web image classification demo!

Why Caffe?

Expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.

Extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back. Thanks to these contributors the framework tracks the state-of-the-art in both code and models.

Speed makes Caffe perfect for research experiments and industry deployment. Caffe can process over 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference and 4 ms/image for learning. We believe that Caffe is the fastest convnet implementation available.

Community: Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Join our community of brewers on the caffe-users group and Github.

* With the ILSVRC2012-winning SuperVision model and caching IO. Consult performance details.

Documentation

DIY Deep Learning for Vision with Caffe
Tutorial presentation.
Tutorial Documentation
Practical guide and framework reference.
arXiv / ACM MM ‘14 paper
A 4-page report for the ACM Multimedia Open Source competition (arXiv:1408.5093v1).
Installation instructions
Tested on Ubuntu, Red Hat, OS X.
Model Zoo
BVLC suggests a standard distribution format for Caffe models, and provides trained models.
Developing & Contributing
Guidelines for development and contributing to Caffe.
API Documentation
Developer documentation automagically generated from code comments.

───

FreeSandal

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】三

The two assumptions we need about the cost function

Parameter

Mathematical functions

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】二

W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】一

CHAPTER 2

How the backpropagation algorithm works

W!o+ 的《小伶鼬工坊演義》︰神經網絡【FFT】八

物質波

波包

小波分析

母小波

和傅里葉變換比較

音樂訊號分析

W!o+ 的《小伶鼬工坊演義》︰神經網絡【FFT】七

Spectrogram

Format

Generation

Speech Recognition with BVLC caffe

Caffe

Why Caffe?

Documentation

輕。鬆。學。部落客

2024 年 4 月
日	一	二	三	四	五	六
« 4 月
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30