W!o+ 的《小伶鼬工坊演義》︰神經網絡【backpropagation】六

俗話說︰千里來龍,在此結穴。 Michael Nielsen 先生這段文字

The backpropagation algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let’s explicitly write this out in the form of an algorithm:

  1. Input x : Set the corresponding activation a^l for the input layer.
  2. Feedforward: For each l = 2, 3, \ldots, L compute z^{l} = w^l a^{l-1}+b^l and a^{l} = \sigma(z^{l}).
  3. Output error \delta^L : Compute the vector \delta^{L} = \nabla_a C \odot \sigma'(z^L).
  4. Backpropagate the error: For each l = L-1, L-2, \ldots, 2 compute \delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^{l}).
  5. Output: The gradient of the cost function is given by \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j and \frac{\partial C}{\partial b^l_j} = \delta^l_j.

Examining the algorithm you can see why it’s called backpropagation. We compute the error vectors δl backward, starting from the final layer. It may seem peculiar that we’re going through the network backward. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.

……

As I’ve described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, C = C_x. In practice, it’s common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

  1. Input a set of training examples
  2. For each training example  x : Set the corresponding input activation a^{x,1, and perform the following steps:
    • Feedforward: For each l = 2, 3, \ldots, L compute z^{x,l} = w^l a^{x,l-1}+b^l and a^{x,l} = \sigma(z^{x,l}).
    • Output error \delta^{x,L} : Compute the vector \delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L}).
    • Backpropagate the error: For each l = L-1, L-2, \ldots, 2 compute \delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,l}).
  3. Gradient descent: For each l = L, L-1, \ldots, 2 update the weights according to the rule w^l \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T, and the biases according to the rule b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}.

Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. I’ve omitted those for simplicity.

───

 

或許讀來輕鬆快意!就連程式碼

The code for backpropagation

Having understood backpropagation in the abstract, we can now understand the code used in the last chapter to implement backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Network class. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batch method updates the Network‘s weights and biases by computing the gradient for the current mini_batch of training examples:

class Network(object):
...
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

 

Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y) which uses the backprop method to figure out the partial derivatives \partial C_x / \partial b^l_j and \partial C_x / \partial w^l_{jk}. The backprop method follows the algorithm in the last section closely. There is one small change – we use a slightly different approach to indexing the layers. This change is made to take advantage of a feature of Python, namely the use of negative list indices to count backward from the end of a list, so, e.g., l[-3] is the third last entry in a list l. The code for backprop is below, together with a few helper functions, which are used to compute the \sigma function, the derivative \sigma', and the derivative of the cost function. With these inclusions you should be able to understand the code in a self-contained way. If something’s tripping you up, you may find it helpful to consult the original description (and complete listing) of the code.

class Network(object):
...
   def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

...

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y) 

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

───

 

也彷彿一目瞭然!!當此之際,需注意『去脈問題』?一如之前

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】六》文本

所講,這個『Stochastic gradient descent』方法是有『假設前提』的,那個『學習率』 \eta 和『批量值』 m 都是『超參數』,如何適當選取,尚須慎思與明辨耶??

此刻讀點『天外飛來一筆』□○ 之文,或有益乎??!!

在《λ 運算︰淵源介紹》一文中,我們談過『函數y = f(x) = x^2 中的『變數x 是『虛名的』,它是函數『定義域』的『輸入值』,而那個 y 是函數『對應域』之『輸出值』。這個『函數』是由『計算規則\Box^2 來確定的,把它叫做 f(x) 或者 g(w), \ z = g(w) = w^2 都指相同的『函數』,因此『函數』就其『抽象定義』來講是『匿名的』。於是所謂的『命名』也只是方便『指稱』或『分別』不同的『函數』罷了!當一個『函數F 的『定義域』中的『元素h 也是『函數』時,此時就很容易發生了『混淆』,它是指一個『定義域』是『變數x 的『複合函數H = F \circ h, \ H(x) = F(h(x)),還是指『定義域』是『函數h(x)  之『泛函數』 functional 的呢?比方說 H(t) = \int_{0}^{t} h(x) dx,也許最好將此類的『泛函數』稱之為『函式』較好。如此當一個含有『未知函數』或者『隱函數』的『複合函數』之方程式被叫做『 泛函數方程式』Functional equation 時,也就能夠作個清楚『區分』的了!!

那麼『 泛函數方程式』有什麼『重要性』的嗎?就讓我們從一個『例子』開始談吧︰

f(x + y) = f(x) \cdot f(y)

。就是說有個『未知函數f  將『定義域』中『兩數的加法x + y 轉換成了『對應域』裡『兩數之乘法』  f(x) \cdot f(y)。這個『未知函數』的『整體性質』就由這個『方程式』 來決定,比方講,這個函數 f(0) = 1,為什麼呢?因為 f(0) = f(0 + 0) = f(0) \cdot f(0) = {f(0)}^2,所以 f(0) = 0f(0) = 1,又為什麼不取 f(0) = 0 的呢?因為如果取  f(0) = 0 的話,f(x) = f(x + 0) = f(x) \cdot f(0) = 0 只是個『零函數』罷了,一般叫做『平凡解』 trivial solution,雖然它為了『完整性』不能夠『被省略』,然而這個解『太顯然』的了 ── 0 + 0 = 0 \cdot 0,似乎不必『再說明』的吧!不過有時這種『論證』的傳統給『初學者』帶來了『理解』的『困難』,故此特別指出。要怎麼『求解』 泛函數方程式的呢?一般說來『非常困難』,有時甚至可能無法得知它到底是有『一個解』、『多個解』、『無窮解』或者根本就『無解』!!近年來漸漸的產生了一些『常用方法』,比方講『動態規劃』 Dynamic programming 中使用『分割征服』的『連續逼近法』或是將原方程式『變換』成適合『定點迭代法』等等。在此我們將介紹『無窮小分析』的『應用』,談一點對於一個『平滑函數f(x) 來說,可以如何『思考』這種方程式的呢?由於 f(x) 是『平滑的』,因此 f(x) 它『可微分』,『導數f^{\prime}(x) 存在而且『連續』,那麼 f(x + \delta x) 就可以表示為

f(x + \delta x) = f(x) + f^{\prime}(x) \cdot \delta x + \epsilon \cdot \delta x

, 此處 \delta x \approx 0 \Longrightarrow \epsilon \approx 0。由於 f(x + y) = f(x) \cdot f(y),可得 f(x + \delta x) = f(x) \cdot f(\delta x),因此 f(x + \delta x) = f(x) \cdot f(\delta x) = f(x) + f^{\prime}(x) \cdot \delta x + \epsilon \cdot \delta x,所以

\frac{f^{\prime}(x)}{f(x)} = \frac{f(\delta x) - 1}{\delta x} - \epsilon

。因為 f(x) 是『平滑的』,那麼 \frac{f^{\prime}(x)}{f(x)} 將存在且連續,然而 \frac{f(\delta x) - 1}{\delta x} - \epsilon 與『變數x 無關,於是當 \delta x \approx 0 時,必定是某個『常數』 constant ,將之命作 k,如此那個『 泛函數方程式』就被改寫成了『微分方程式

f^{\prime}(x) = k \cdot f(x), \ f(0) = 1

。它的『』是 f(x) = e^{k \cdot x}, \ k \neq 0,為什麼 k \neq 0 的呢?因為此時 f^{\prime}(x) = 0,將會得到 f(x) 是個『常數函數』,這就是前面說的那個『平凡解』 的啊!!

乍一看,這個『解法』果真『奇妙』,竟然變成了『微分方程式』的了!假使細察 f(x + y) = f(x) \cdot f(y) 可說是『函數f(x) 的『大域性質』,但是 f(x + \delta x) = f(x) + f^{\prime}(x) \cdot \delta x + \epsilon \cdot \delta x 卻描述『函數f(x) 之『微觀鄰域』,將此兩者合併考慮,果真是『可以的』嗎?如果說一個函數的『大域性質』自然『含括\Longrightarrow 了它的『微觀鄰域』之『描述』,由於它是一個『平滑的』函數,那麼『微觀鄰域』的『微分方程式』難道不能『整合』 integrate \Longrightarrow 成那個『大域性質』的嗎?

─── 摘自《【Sonic π】電聲學之電路學《四》之《一》