W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】六

2016-04-20 懸鉤子

此處 Michael Nielsen 先生談起『理論』上可行的方法，『實務』上可能會不管用！！？？

People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of $C$ $C$ , and this can be quite costly. To see why it’s costly, suppose we want to compute all the second partial derivatives $\partial^{2} C / \partial v_{j} \partial v_{k}$ . If there are a million such $v_j$ $v_{j}$ variables then we’d need to compute something like a trillion (i.e., a million squared) second partial derivatives*

*Actually, more like half a trillion, since $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j$ $\partial^{2} C / \partial v_{j} \partial v_{k} = \partial^{2} C / \partial v_{k} \partial v_{j}$ . Still, you get the point.!

That’s going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we’ll use gradient descent (and variations) as our main approach to learning in neural networks.

How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights $w_{k}$ and biases $b_{l}$ which minimize the cost in Equation (6). To see how this works, let’s restate the gradient descent update rule, with the weights and biases replacing the variables $v_j$ $v_{j}$ . In other words, our “position” now has components $w_k$ $w_{k}$ and $b_l$ $b_{l}$ , and the gradient vector $\nabla C$ $\nabla C$ has corresponding components $\partial C / \partial w_k$ $\partial C / \partial w_{k}$ and $\partial C / \partial b_l$ $\partial C / \partial b_{l}$ . Writing out the gradient descent update rule in terms of components, we have

w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \ \ \ \ \ (16)

b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l} \ \ \ \ \ (17)

By repeatedly applying this update rule we can “roll down the hill”, and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.

There are a number of challenges in applying the gradient descent rule. We’ll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let’s look back at the quadratic cost in Equation (6). Notice that this cost function has the form $C = \frac{1}{n} \sum_x C_x$ $C = \frac{1}{n} \sum_{x} C_{x}$ , that is, it’s an average over costs $C_x \equiv \frac{\|y(x)-a\|^2}{2}$ $C_{x} \equiv \frac{∥ y (x) - a ∥^{2}}{2}$ for individual training examples. In practice, to compute the gradient $\nabla C$ $\nabla C$ we need to compute the gradients $\nabla C_x$ $\nabla C_{x}$ separately for each training input, $x$ , and then average them, $\nabla C = \frac{1}{n} \sum_x \nabla C_x$ . Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.

An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient $\nabla C$ $\nabla C$ by computing $\nabla C_x$ $\nabla C_{x}$ for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient $\nabla C$ $\nabla C$ , and this helps speed up gradient descent, and thus learning.

………

什麼是『隨機』呢？維基百科詞條這麼說︰

Stochastic

The term stochastic occurs in a wide variety of professional or academic fields to describe events or systems that are unpredictable due to the influence of a random variable. The word “stochastic” comes from the Greek word στόχος (stokhos, “aim”).

Researchers refer to physical systems in which they are uncertain about the values of parameters, measurements, expected input and disturbances as “stochastic systems”. In probability theory, a purely stochastic system is one whose state is randomly determined, having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely. In this regard, it can be classified as non-deterministic (i.e., “random”) so that the subsequent state of the system is determined probabilistically. Any system or process that must be analyzed using probability theory is stochastic at least in part.^[1]^[2] Stochastic systems and processes play a fundamental role in mathematical models of phenomena in many fields of science, engineering, finance and economics.

───

或可嘗試進一步了解︰

Stochastic gradient descent

Stochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions.

Iterative method

In stochastic (or “on-line”) gradient descent, the true gradient of $Q(w)$ is approximated by a gradient at a single example:

w := w - \eta \nabla Q_i(w).

As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.

In pseudocode, stochastic gradient descent can be presented as follows:

Choose an initial vector of parameters $w$ and learning rate $\eta$ .
Repeat until an approximate minimum is obtained:
- Randomly shuffle examples in the training set.
- For $\! i=1, 2, ..., n$ , do:
  - $\! w := w - \eta \nabla Q_i(w).$

A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a “mini-batch”) at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates $\eta$ decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.^[3] ^[4] This is in fact a consequence of the Robbins-Siegmund theorem.^[5]

Stogra

Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.

───

或可藉『統計力學』揣想它之『合理性』可能有根源耶？？！！

假使我們思考這樣的一個『問題』︰

一個由大量粒子構成的『物理系統』，這些粒子具有某一個『物理過程』描述的『隨機變量』 $X_i, i=1 \cdots N$ ，那麼在 $t$ 時刻，這個『隨機變量』的『大數平均值』

$\frac{1}{N} \sum \limits_{i=1}^{N} P[X = X_i] \cdot X_i$

，是這個『物理系統』由大量粒子表現的『瞬時圖像』，也就是『統計力學』上所說的『系綜平均』ensemble average 值。再從一個『典型粒子』的『隨機運動』上講，這個『隨機變量』 $X_{this} (t_i), i = 1 \cdots N$ 會在不同時刻隨機的取值，因此就可以得到此一個『典型粒子』之『隨機變量』的『時間平均值』

$\frac{1}{N} \sum \limits_{i=1}^{N} P[t = t_i] \cdot X_{this}(t_i)$

，這說明了此一『典型粒子』在『物理系統』中的『歷時現象』，那麼此兩種平均值，它們的大小是一樣的嗎？？

在『德汝德模型』中我們已經知道 $P_{nc}(t) = e^{- t / \tau}$ 是一個『電子』於 $t$ 時距裡不發生碰撞的機率。這樣 $P_{nc}(t) - P_{nc}(t+dt)$ 的意思就是，在 $t$ 到 $t+dt$ 的時間點發生碰撞的機率。參考指數函數 $e$ 的『泰勒展開式』

$e^x = \sum \limits_{k=0}^{\infty} \frac{x^k}{k!}$

，如此

$P_{nc}(t) - P_{nc}(t+dt) = e^{- \frac {t}{ \tau}} - e^{- \frac {t+dt}{ \tau}} = e^{- \frac {t}{ \tau}} \left[ 1 - e^{- \frac {dt}{ \tau}} \right]$

$\approx e^{- \frac {t}{ \tau}} \cdot \frac{dt}{\tau}$

，這倒過來說明了為什麼在『德汝德模型』中，發生碰撞的機率是 $\frac{dt}{\tau}$ ，於是一個有 $N$ 個『自由電子』的導體，在 $t+dt$ 時刻可能有 $N \cdot e^{- \frac {t}{ \tau}} \cdot \frac{dt}{\tau}$ 個電子發生碰撞，碰撞『平均時距』的『系綜平均』是

$\frac{1}{N} \int_{0}^{\infty} t \cdot N \cdot e^{- \frac {t}{ \tau}} \cdot \frac{dt}{\tau} = \tau$

。比之於《【Sonic π】電路學之補充《一》》一文中之電子的『時間平均值』，果然這兩者相等。事實上一般物理系統要是處於統計力學所說的『平衡狀態』，這兩種『平均值』都會是『相等』的。當真是『考典範以歷史』與『察大眾於一時』都能得到相同結論的嗎？？

─── 摘自《【Sonic π】電路學之補充《二》》

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】五

2016-04-19 懸鉤子

Michael Nielsen 先生用了一大段文字解釋

Learning with gradient descent

Now that we have a design for our neural network, how can it learn to recognize digits? The first thing we’ll need is a data set to learn from – a so-called training data set. We’ll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST’s name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States’ National Institute of Standards and Technology. Here’s a few images from MNIST:

As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we’ll ask it to recognize images which aren’t in the training set!

The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We’ll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn’t see during training.

We’ll use the notation $x$ to denote a training input. It’ll be convenient to regard each training input $x$ as a $28 \times 28 =784$ $28 \times 28 = 784$ dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We’ll denote the corresponding desired output by $y = y (x)$ , where $y$ is a $10$ $10$ -dimensional vector. For example, if a particular training image, $x$ , depicts a $6$ $6$ , then $y (x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^{T}$ is the desired output from the network. Note that $T$ here is the transpose operation, turning a row vector into an ordinary (column) vector.

What we’d like is an algorithm which lets us find weights and biases so that the output from the network approximates $y (x)$ for all training inputs $x$ $x$ . To quantify how well we’re achieving this goal we define a cost function*

*Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it’s often used in research papers and other discussions of neural networks. :

$C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2$ .

Here, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, $a$ is the vector of outputs from the network when $x$ is input, and the sum is over all training inputs, $x$ . Of course, the output $a$ depends on $x$ , $w$ and $b$ , but to keep the notation simple I haven’t explicitly indicated this dependence. The notation $∥ v ∥$ just denotes the usual length function for a vector $v$ . We’ll call $C$ the quadratic cost function; it’s also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that $C(w,b)$ $C (w, b)$ is non-negative, since every term in the sum is non-negative. Furthermore, the cost $C (w, b)$ becomes small, i.e., $C (w, b) \approx 0$ , precisely when $y(x)$ $y (x)$ is approximately equal to the output, $a$ , for all training inputs, $x$ . So our training algorithm has done a good job if it can find weights and biases so that $C (w, b) \approx 0$ . By contrast, it’s not doing so well when $C (w, b)$ is large – that would mean that $y (x)$ is not close to the output $a$ for a large number of inputs. So the aim of our training algorithm will be to minimize the cost $C (w, b)$ as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We’ll do that using an algorithm known as gradient descent.

……

Indeed, there’s even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let’s suppose that we’re trying to make a move $Δ v$ in position so as to decrease $C$ as much as possible. This is equivalent to minimizing $\Delta C \approx \nabla C \cdot \Delta v$ $Δ C \approx \nabla C \cdot Δ v$ . We’ll constrain the size of the move so that $\| \Delta v \| = \epsilon$ $∥ Δ v ∥ = ϵ$ for some small fixed $\epsilon >0$ $ϵ > 0$ . In other words, we want a move that is a small step of a fixed size, and we’re trying to find the movement direction which decreases $C$ as much as possible. It can be proved that the choice of $Δ v$ which minimizes $\nabla C \cdot \Delta v$ $\nabla C \cdot Δ v$ is $\Delta v = - \eta \nabla C$ $Δ v = - η \nabla C$ , where $\eta = \epsilon / \|\nabla C\|$ $η = ϵ / ∥ \nabla C ∥$ is determined by the size constraint $\|\Delta v\| = \epsilon$ . So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease $C$ .

───

或許這麼費勁的原因，是希望讀者能夠直覺掌握

梯度下降法

梯度下降法（英語：Gradient descent）是一個最佳化算法，通常也稱為最速下降法。

描述

梯度下降法，基於這樣的觀察：如果實值函數 $F(\mathbf{x})$ 在點 $\mathbf{a}$ 處可微且有定義，那麼函數 $F(\mathbf{x})$ 在 $\mathbf{a}$ 點沿著梯度相反的方向 $-\nabla F(\mathbf{a})$ 下降最快。

因而，如果

\mathbf{b}=\mathbf{a}-\gamma\nabla F(\mathbf{a})

對於 $\gamma>0$ 為一個夠小數值時成立，那麼 $F(\mathbf{a})\geq F(\mathbf{b})$ 。

考慮到這一點，我們可以從函數 $F$ 的局部極小值的初始估計 $\mathbf{x}_0$ 出發，並考慮如下序列 $\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \dots$ 使得

\mathbf{x}_{n+1}=\mathbf{x}_n-\gamma_n \nabla F(\mathbf{x}_n),\ n \ge 0

。

因此可得到

F(\mathbf{x}_0)\ge F(\mathbf{x}_1)\ge F(\mathbf{x}_2)\ge \cdots,

如果順利的話序列 $(\mathbf{x}_n)$ 收斂到期望的極值。注意每次疊代步長 $\gamma$ 可以改變。

右側的圖片示例了這一過程，這裡假設 $F$ 定義在平面上，並且函數圖像是一個碗形。藍色的曲線是等高線（水平集），即函數 $F$ 為常數的集合構成的曲線。紅色的箭頭指向該點梯度的反方向。（一點處的梯度方向與通過該點的等高線垂直）。沿著梯度下降方向，將最終到達碗底，即函數 $F$ 值最小的點。

350px-Gradient_descent

梯度下降法的描述。

例子

梯度下降法處理一些複雜的非線性函數會出現問題，例如Rosenbrock函數

f(x, y) =(1-x)^2 + 100(y-x^2)^2 .\quad

其最小值在 $(x, y)=(1, 1)$ 處，數值為 $f(x, y)=0$ 。但是此函數具有狹窄彎曲的山谷，最小值 $(x, y)=(1, 1)$ 就在這些山谷之中，並且谷底很平。優化過程是之字形的向極小值點靠近，速度非常緩慢。

───

然而『跨學科』之類比︰

Okay, so calculus doesn’t work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn’t be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We’d randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of $C$ – those derivatives would tell us everything we need to know about the local “shape” of the valley, and therefore how our ball should roll.

Based on what I’ve just written, you might suppose that we’ll be trying to write down Newton’s equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we’re not going to take the ball-rolling analogy quite that seriously – we’re devising an algorithm to minimize $C$ , not developing an accurate simulation of the laws of physics! The ball’s-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let’s simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?

───

是否可以促進『理解』？實有賴於『STEM』教育之落實！

Science, Technology, Engineering and Mathematics (STEM, previously SMET) is an education grouping used worldwide. The acronym refers to the academic disciplines of science^{[note 1]}, technology, engineering and mathematics.^[1] The term is typically used when addressing education policy and curriculum choices in schools to improve competitiveness in science and technology development. It has implications for workforce development, national security concerns and immigration policy.^[1] The acronym arose in common use shortly after an interagency meeting on science education held at the US National Science Foundation chaired by the then NSF director Rita Colwell. A director from the Office of Science division of Workforce Development for Teachers and Scientists, Dr. Peter Faletra, suggested the change from the older acronym SMET to STEM. Dr. Colwell, expressing some dislike for the older acronym, responded by suggesting NSF to institute the change. One of the first NSF projects to use the acronym was STEMTEC, the Science, Technology, Engineering and Math Teacher Education Collaborative at the University of Massachusetts Amherst, which was funded in 1997.

───

比方說，在《踏雪尋梅！！》一文中，我們談過︰

那麼科學上如何看待『預言』的呢？比方講一七四四年瑞士大數學家和物理學家萊昂哈德‧歐拉 Leonhard Euler 在《尋找具有極大值或極小值性質的曲線，等周問題的最廣義解答》 Methodus inveniendi lineas curvas maximi minimive proprietate gaudentes, sive solutio problematis isoperimetrici lattissimo sensu accepti 論文中，非常清晰明白的給出『最小作用量原理』的定義

假使一個質量為 $M$ ，速度為 $v$ 的粒子移動無窮小距離 $ds$ 時。這時粒子的動量為 $M \cdot v$ ，當乘以此無窮小距離 $ds$ 後，給出 $M \cdot v \ ds$ ，這是粒子的動量作用於無窮小『路徑』 $ds$ 距離之上。我宣稱︰在所有連結『始終』兩個端點的可能『路徑』之中，這個粒子運動的真實『軌跡』是 $\int_{initial}^{final} M \cdot v \ ds$ 為最小值的『路徑』；如果假定質量是個常數，也就是 $\int_{initial}^{final} v \ ds$ 為最小值的『軌道』。

也就是說，在所有連結『始終』兩個端點的可能『路徑』 $path$ 之中，粒子所選擇的『路徑』是『作用量』 $A = \int_{path} M \cdot v \ ds$ 泛函數的『極值』，這是牛頓第二運動定律的『變分法』Variation 描述。如果從今天物理能量的觀點來看 $A = \int_{path} M \cdot v \ ds = \int_{path} M \cdot v \ \frac {ds}{dt} dt = \int_{path} M \cdot v^2 dt = 2 \int_{path} T dt$ ，此處 $T = \frac{1}{2} M v^2$ 就是粒子的動能。因為牛頓第二運動定律可以表述為 $F = M \cdot a = \frac {d P}{dt}, \ P = M \cdot v$ ，所以 $\int_{path} \frac {d P}{dt} ds = \int_{path} \frac {d s}{dt} dP = \int_{path} v dP = \Delta T = \int_{path} F ds$ 。

假使粒子所受的力是『保守力』conservative force，也就是講此力沿著任何路徑所作的『功』work 只跟粒子『始終』兩個端點的『位置』有關，與它行經的『路徑』無關。在物理上這時通常將它定義成這個『力場』的『位能』 $V = - \int_{ref}^{position} F ds$ ，於是如果一個粒子在一個保守場中， $\Delta T + \Delta V = 0$ ，這就是物理上『能量守恆』原理！舉例來說重力、彈簧力、電場力等等，都是保守力，然而摩擦力和空氣阻力種種都是典型的非保守力。由於 $\Delta V$ 在這些可能路徑裡都不變，因此『最小作用量原理』所確定的『路徑』也就是『作用量』 $A$ 的『極值』。一七八八年法國籍義大利裔數學家和天文學家約瑟夫‧拉格朗日 Joseph Lagrange 對於變分法發展貢獻很大，最早在其論文《分析力學》Mecanique Analytique 裡，使用『能量守恆定律』推導出了歐拉陳述的最小作用量原理的正確性。

從數學上講運動的『微分方程式』等效於對應的『積分方程式』，這本不是什麼奇怪的事，當人們開始考察它的『哲學意義』，可就引發很多不同的觀點。有人說 $F = m a$ 就像『結果 $\propto$ 原因』描繪『因果』的『瞬刻聯繫』關係，這是一種『決定論』，從一個『時空點』推及『無窮小時距』 $dt$ 接續的另一個『時空點』，因此一旦知道『初始狀態』，就已經確定了它的『最終結局』！有人講 $A = \int_{initial}^{final} M \cdot v \ ds$ 彷彿確定了『目的地』無論從哪個『起始處』出發，總會有一個『通達路徑』，這成了一種『目的論』，大自然自會找到『此時此處』通向『彼時彼處』的『道路』！！各種意義『詮釋』果真耶？宛如說『花開自有因，將要為誰妍』？？

───

那麼一個已經明白

位能

位能的保守力定義

主條目：保守力

如果分別作用於兩個質點上的作用力與反作用力作功與具體路徑無關，只取決於交互作用質點初末位置，那麼這樣的一對力就叫作保守力。不滿足這個條件的則稱為非保守力。可以證明保守場的幾個等價條件^[1]，於是我們得到保守力的性質有：

保守力沿給定兩點間作功與路徑無關；
保守力沿任意環路作功為零；
保守力可以表示為一個純量函數的（負）梯度；

推廣到多質點體系和連續分布物體，如果一封閉系統中任意兩個質點之間的作用力都是保守力，則稱該系統為保守體系。保守體系的位形，即在保守體系中各質點的相對位置發生變化時，其間的交互作用力作功，作功之和只與各質點相對位置有關。將保守體系在保守力作用下的這種與相對位置相聯繫的作功的能力定義為一個函數，稱為該保守體系的勢能函數或位能函數，簡稱勢能或位能^[2]。這樣，體系從一種位形變為另一種位形時對外界所作的功等於後者與前者的位能之差，從而賦予了位能函數以直觀的物理意義。

除此之外，我們還可以將位能的定義從現在的基礎上拓展。比如熱學中氣體分子間的交互作用位能，它是大量分子位能的和，實際不是用相對位置（位形）來描述的，而是用體積、溫度、壓強等熱學參量。又如，在一些特定的約束條件下，某些平時是非保守力的力也成為了保守力^[3]，或者幾種力的淨力恰巧成為了一個保守力。

───

以及

梯度定理

梯度定理（英語：gradient theorem），也叫線積分基本定理，是說純量場梯度沿曲線的積分可用純量場在該曲線兩端的值之差來計算。

設函數 $\varphi : U \subseteq \mathbb{R}^n \to \mathbb{R}$ ，則

\varphi\left(\mathbf{q}\right)-\varphi\left(\mathbf{p}\right) = \int_{\gamma[\mathbf{p},\,\mathbf{q}]} \nabla\varphi(\mathbf{r})\cdot d\mathbf{r}.

梯度定理把微積分基本定理從直線數軸推廣到平面、空間，乃至一般的 $n$ 維空間中的曲線。

梯度定理表明梯度場的曲線積分是路徑無關的，這是物理學中「保守力」的定義方式之一。如果 $\varphi$ 是位勢，則 $\nabla\varphi$ 就是保守向量場。上面的公式表明：保守力做功只和物體運動路徑的端點有關，而與路徑本身無關。

梯度定理有個逆定理，是說任何路徑無關的向量場都可以表示為某個純量場的梯度。這個逆定理和原定理一樣在純粹和應用數學中有很多推論和應用。

───

基本概念者，豈不了解『負梯度』其實源自自然的耶！！？？

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】四

2016-04-18 懸鉤子

由於 Michael Nielsen 先生此處談及之章節，淺顯易明︰

A simple network to classify handwritten digits

Having defined neural networks, let’s return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, we’d like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we’d like to break the image

into six separate images,

We humans solve this segmentation problem with ease, but it’s challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, we’d like our program to recognize that the first digit above,

is a 5.

We’ll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it’s probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we’ll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.

To recognize individual digits we will use a three-layer neural network:

The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many

28

28

pixel images of scanned handwritten digits, and so the input layer contains

784 = 28 \times 28

neurons. For simplicity I’ve omitted most of the

784

input neurons in the diagram above. The input pixels are greyscale, with a value of

0.0

representing white, a value of

1.0

representing black, and in between values representing gradually darkening shades of grey.

───

於是心想何不趁此機會，介紹讀者一本理論性『神經網絡』的好書

Neural Networks – A Systematic Introduction

Raúl Rojas, Springer-Verlag, Berlin, 1996, 502 S. (two editions)

Book Description

Neural networks are a computing paradigm that is finding increasing attention among computer scientists. In this book, theoretical laws and models previously scattered in the literature are brought together into a general theory of artificial neural nets. Always with a view to biology and starting with the simplest nets, it is shown how the properties of models change when more general computing elements and net topologies are introduced. Each chapter contains examples, numerous illustrations, and a bibliography. The book is aimed at readers who seek an overview of the field or who wish to deepen their knowledge. It is suitable as a basis for university courses in neurocomputing.

希望十年前之慧劍，仍能在今日開疆闢土耶？？

Neural Networks – A Systematic Introduction

a book by Raul Rojas

Foreword by Jerome Feldman

Springer-Verlag, Berlin, New-York, 1996 (502 p.,350 illustrations).

Forword, Preface chapter 1, chapter 2, chapter 3, chapter 4, chapter 5, chapter 6, chapter 7, chapter 8, chapter 9, chapter 10, chapter 11, chapter 12, chapter 13, chapter 14, chapter 15, chapter 16, chapter 17, chapter 18, References

Whole Book (PDF)

Review in “Computer Reviews”

Reported errata

切莫祇淺嚐即止乎！！

One and Two Layered Networks

6.1 Structure and geometric visualization
In the previous chapters the computational properties of isolated threshold units have been analyzed extensively. The next step is to combine these elements and look at the increased computational power of the network. In this chapter we consider feed-forward networks structured in successive layers of computing units.

6.1.1 Network architecture
The networks we want to consider must be defined in a more precise way in terms of their architecture. The atomic elements of any architecture are the computing units and their interconnections. Each computing unit collects the information from $n$ input lines with an integration function $\Psi : R^n \longrightarrow R$ . The total excitation computed in this way is then evaluated using an activation function $\Phi : R \longrightarrow R$ . In perceptrons the integration function is the sum of the inputs. The activation (also called output function) compares the sum with a threshold. Later we will generalize $\Phi$ to produce all values between 0 and 1. In the case of $\Psi$ some functions other than addition can also be considered [454], [259]. In this case the networks can compute some difficult functions
with fewer computing units.

Definition 9. A network architecture is a tuple $(I, N, O, E)$ consisting of a set I of input sites, a set $N$ of computing units, a set $O$ of output sites and a set $E$ of weighted directed edges. A directed edge is a tuple $(u, v, w)$ whereby $u \in I \cup N$ , $v \in N \cup O$ and $w \in R$ .

The input sites are just entry points for information into the network and do not perform any computation. Results are transmitted to the output sites. The set $N$ consists of all computing elements in the network. Note that the edges between all computing units are weighted, as are the edges between input and output sites and computing units.

………

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】三

2016-04-17 懸鉤子

在開始探討『神經網絡』手寫阿拉伯數字辨識之前， Michael Nielsen 先生先介紹了它的主要『架構』以及使用之『術語』︰

The architecture of neural networks

In the next section I’ll introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network:

As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons. The rightmost or output layer contains the output neurons, or, as in this case, a single output neuron. The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs. The term “hidden” perhaps sounds a little mysterious – the first time I heard the term I thought it must have some deep philosophical or mathematical significance – but it really means nothing more than “not an input or an output”. The network above has just a single hidden layer, but some networks have multiple hidden layers. For example, the following four-layer network has two hidden layers:

Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I’m not going to use the MLP terminology in this book, since I think it’s confusing, but wanted to warn you of its existence.

───

如果從

圖 (數學)

在數學上，一個圖（Graph）是表示物件與物件之間的關係的方法，是圖論的基本研究對象。一個圖看起來是由一些小圓點（稱為頂點或結點）和連結這些圓點的直線或曲線（稱為邊）組成的。

───

之『拓樸』觀點來看，所談『神經網絡』之『連接性』相當簡單。當真是比『知識網』，甚或『捷運網』都還容易︰

在

《 Simply Logical
Intelligent Reasoning by Example 》

之第三章開始處， Peter Flach 說︰

3 Logic Programming and Prolog
In the previous chapters we have seen how logic can be used to represent knowledge about a particular domain, and to derive new knowledge by means of logical inference. A distinct feature of logical reasoning is the separation between model theory and proof theory: a set of logical formulas determines the set of its models, but also the set of formulas that can be derived by applying inference rules. Another way to say the same thing is: logical formulas have both a declarative meaning and a procedural meaning. For instance, declaratively the order of the atoms in the body of a clause is irrelevant, but procedurally it may determine the order in which different answers to a query are found.

Because of this procedural meaning of logical formulas, logic can be used as a programming language. If we want to solve a problem in a particular domain, we write down the required knowledge and apply the inference rules built into the logic programming language. Declaratively, this knowledge specifies what the problem is, rather than how it should be solved. The distinction between declarative and procedural aspects of problem solving is succinctly expressed by Kowalski’s equation

algorithm = logic + control

Here, logic refers to declarative knowledge, and control refers to procedural knowledge. The equation expresses that both components are needed to solve a problem algorithmically.

In a purely declarative programming language, the programmer would have no means to express procedural knowledge, because logically equivalent programs would behave identical. However, Prolog is not a purely declarative language, and therefore the procedural meaning of Prolog programs cannot be ignored. For instance, the order of the literals in the body of a clause usually influences the efficiency of the program to a large degree. Similarly, the order of clauses in a program often determines whether a program will give an answer at all. Therefore, in this chapter we will take a closer look at Prolog’s inference engine and its built-in features (some of which are non-declarative). Also, we will discuss some common programming techniques.

就讓我們舉個典型例子 ── 『台北捷運網』一小部份 ──，講講『陳述地』 declaratively 以及『程序地』procedurally 的『意義』不同，如何展現在程式『思考』和『寫作』上。

在這個例子裡，我們將以『忠孝新生』站為中心，含括了二十五個捷運站，隨意不依次序給定站名如下︰

松江南京、大安森林公園、善導寺、南京復興、忠孝復興

台北小巨蛋、小南門、忠孝新生、中山、台北車站

龍山寺、忠孝敦化、西門、雙連、中山國中

信義安和、中正紀念堂、古亭、大安、行天宮

東門、台大醫院、北門、科技大樓、世貿台北101

【例子捷運圖】

假使我們定義

連接( □, ○) 代表 □ 捷運站，僅經過 □ 站，直通 ○ 捷運站，方向是 □ → ○。如此我們可以把這部份『捷運網』，表示為︰

pi@raspberrypi ~ $python3 Python 3.2.3 (default, Mar 1 2013, 11:53:50) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. # 定義站站間『連接』事實 >>> from pyDatalog import pyDatalog >>> pyDatalog.create_terms('連接, 鄰近, 能達, 所有路徑') >>> +連接('台北車站', '中山') >>> +連接('台北車站', '善導寺') >>> +連接('台北車站', '西門') >>> +連接('台北車站', '台大醫院') >>> +連接('西門', '龍山寺') >>> +連接('西門', '小南門') >>> +連接('西門', '北門') >>> +連接('中山', '北門') >>> +連接('中山', '雙連') >>> +連接('中山', '松江南京') >>> +連接('中正紀念堂', '小南門') >>> +連接('中正紀念堂', '台大醫院') >>> +連接('中正紀念堂', '東門') >>> +連接('中正紀念堂', '古亭') >>> +連接('東門', '古亭') >>> +連接('東門', '忠孝新生') >>> +連接('東門', '大安森林公園') >>> +連接('善導寺', '忠孝新生') >>> +連接('松江南京', '忠孝新生') >>> +連接('忠孝復興', '忠孝新生') >>> +連接('松江南京', '行天宮') >>> +連接('松江南京', '南京復興') >>> +連接('中山國中', '南京復興') >>> +連接('台北小巨蛋', '南京復興') >>> +連接('忠孝復興', '南京復興') >>> +連接('忠孝復興', '忠孝敦化') >>> +連接('忠孝復興', '大安') >>> +連接('大安森林公園', '大安') >>> +連接('科技大樓', '大安') >>> +連接('信義安和', '大安') >>> +連接('信義安和', '世貿台北101') >>> </pre> 此處『連接』次序是『隨興』輸入的，一因『網路圖』沒有個經典次序，次因『事實』『陳述』不會因次序而改變，再因『程序』上『pyDatalog』對此『事實』『次序』也並不要求。由於『連接』之次序有『起止』方向性，上面的陳述並不能代表那個『捷運網』，這可以從下面程式片段得知。【※ 在 pyDatalog 中，沒有變元的『查詢』 ask or query ，以輸出『set([()]) 』表示一個存在的事實，以輸出『None』表達所查詢的不是個事實。】 <pre class="lang:sh decode:true"># 單向性 >>> pyDatalog.ask("連接('信義安和', '世貿台北101')") == set([()]) True >>> pyDatalog.ask("連接('世貿台北101', '信義安和')") == set([()]) False >>> pyDatalog.ask("連接('世貿台北101', '信義安和')") == None True >>> </pre> 所以我們必須給定『連接( □, ○)』是具有『雙向性』的，也就是 連接(X站名,Y站名)<=連接(Y站名,X站名) ，這樣的『規則』 Rule 。由於 pyDatalog 的『語詞』 Term 使用前都必須『宣告』，而且『變元』必須『大寫開頭』，因此我們得用 pyDatalog.create_terms('X站名, Y站名, Z站名, P路徑甲, P路徑乙') 這樣的『陳述句』 Statement。【※ 中文沒有大小寫，也許全部被當成了小寫，所以變元不得不以英文大寫起頭。】 ─── 摘自《<a href="http://www.freesandal.org/?p=37256">勇闖新世界︰《 pyDatalog 》導引《七》</a>》 假使細思$ z_j = \sum_i w_{ji} \cdot x_i + b_j$ 表達式，或許自可發現『 S 神經元』網絡計算與『矩陣』數學密切關聯︰

Matrix (mathematics)

Definition

A matrix is a rectangular array of numbers or other mathematical objects for which operations such as addition and multiplication are defined.^[6] Most commonly, a matrix over a field F is a rectangular array of scalars each of which is a member of F.^[7]^[8] Most of this article focuses on real and complex matrices, that is, matrices whose elements are real numbers or complex numbers, respectively. More general types of entries are discussed below. For instance, this is a real matrix:

\mathbf{A} = \begin{bmatrix} -1.3 & 0.6 \\ 20.4 & 5.5 \\ 9.7 & -6.2 \end{bmatrix}.

The numbers, symbols or expressions in the matrix are called its entries or its elements. The horizontal and vertical lines of entries in a matrix are called rows and columns, respectively.

Size

The size of a matrix is defined by the number of rows and columns that it contains. A matrix with m rows and n columns is called an m × n matrix or m-by-n matrix, while m and n are called its dimensions. For example, the matrix A above is a 3 × 2 matrix.

Matrices which have a single row are called row vectors, and those which have a single column are called column vectors. A matrix which has the same number of rows and columns is called a square matrix. A matrix with an infinite number of rows or columns (or both) is called an infinite matrix. In some contexts, such as computer algebra programs, it is useful to consider a matrix with no rows or no columns, called an empty matrix.

Name	Size	Example	Description
Row vector	1 × n	$\begin{bmatrix}3 & 7 & 2 \end{bmatrix}$	A matrix with one row, sometimes used to represent a vector
Column vector	n × 1	$\begin{bmatrix}4 \\ 1 \\ 8 \end{bmatrix}$	A matrix with one column, sometimes used to represent a vector
Square matrix	n × n	$\begin{bmatrix} 9 & 13 & 5 \\ 1 & 11 & 7 \\ 2 & 6 & 3 \end{bmatrix}$	A matrix with the same number of rows and columns, sometimes used to represent a linear transformation from a vector space to itself, such as reflection, rotation, or shearing.

……

Linear equations

Main articles: Linear equation and System of linear equations

Matrices can be used to compactly write and work with multiple linear equations, that is, systems of linear equations. For example, if A is an m-by-n matrix, x designates a column vector (that is, n×1-matrix) of n variables x₁, x₂, ..., x_n, and b is an m×1-column vector, then the matrix equation

Ax = b

is equivalent to the system of linear equations

A_1,1x₁ + A_1,2x₂ + ... + A_1,nx_n = b₁

...

A_m,1x₁ + A_m,2x₂ + ... + A_m,nx_n = b_m .^[24]

……

Relationship to linear maps

Linear maps Rⁿ → R^m are equivalent to m-by-n matrices, as described above. More generally, any linear map f: V → W between finite-dimensional vector spaces can be described by a matrix A = (a_ij), after choosing bases v₁, ..., v_n of V, and w₁, ..., w_m of W (so n is the dimension of V and m is the dimension of W), which is such that

f(\mathbf{v}_j) = \sum_{i=1}^m a_{i,j} \mathbf{w}_i\qquad\mbox{for }j=1,\ldots,n.

In other words, column j of A expresses the image of v_j in terms of the basis vectors w_i of W; thus this relation uniquely determines the entries of the matrix A. Note that the matrix depends on the choice of the bases: different choices of bases give rise to different, but equivalent matrices.^[60] Many of the above concrete notions can be reinterpreted in this light, for example, the transpose matrix A^T describes the transpose of the linear map given by A, with respect to the dual bases.^[61]

These properties can be restated in a more natural way: the category of all matrices with entries in a field $k$ with multiplication as composition is equivalent to the category of finite dimensional vector spaces and linear maps over this field.

More generally, the set of m×n matrices can be used to represent the R-linear maps between the free modules R^m and Rⁿ for an arbitrary ring R with unity. When n = m composition of these maps is possible, and this gives rise to the matrix ring of n×n matrices representing the endomorphism ring of Rⁿ.

───

何不趁此機會複習或學習一下的耶！！

樹莓派、樹莓派之學習、樹莓派之教育

W!o+ 的《小伶鼬工坊演義》︰神經網絡【Sigmoid】二

2016-04-16 懸鉤子

雖然 Michael Nielsen 先生嘗試藉著最少的數學來談『 S 神經元』︰

What about the algebraic form of $σ$ ? How can we understand that? In fact, the exact form of $σ$ isn’t so important – what really matters is the shape of the function when plotted. Here’s the shape:

This shape is a smoothed out version of a step function:

If $σ$ had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be $1$ or $0$ depending on whether $w \cdot x+b$ $w \cdot x + b$ was positive or negative*

*Actually, when $w \cdot x + b = 0$ the perceptron outputs $0$ , while the step function outputs $1$ . So, strictly speaking, we’d need to modify the step function at that one point. But you get the idea.

. By using the actual $σ$ function we get, as already implied above, a smoothed out perceptron. Indeed, it’s the smoothness of the $σ$ function that is the crucial fact, not its detailed form. The smoothness of $σ$ means that small changes $\Delta w_j$ $Δ w_{j}$ in the weights and $Δ b$ in the bias will produce a small change $\Delta output$ $Δ output$ in the output from the neuron. In fact, calculus tells us that $\Delta output$ $Δ output$ is well approximated by

$\Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \ \ \ \ \ (5)$

where the sum is over all the weights, $w_j$ , and $\partial \, \mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partial b$ denote partial derivatives of the $output$ with respect to $w_j$ $w_{j}$ and $b$ , respectively. Don’t panic if you’re not comfortable with partial derivatives! While the expression above looks complicated, with all the partial derivatives, it’s actually saying something very simple (and which is very good news): $\Delta output$ $Δ output$ is a linear function of the changes $\Delta w_j$ $Δ w_{j}$ and $\Delta b$ $Δ b$ in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.

───

假使能夠深入了解背後的數學，或許會更容易掌握的吧！何不開卷閱讀『無窮小』系列文本，從直覺觀點學習微積分的呢！！

在《吃一節 TCPIP！！中》一文中我們談到了『拓撲學』 Topology 一詞源自希臘文『地點之研究』，始於歐拉柯尼斯堡的七橋問題。這門數學探討『連通性』 connectedness 、『連續性』 continuity 、以及『邊界』 boundary。它不用東西的『形狀』來作分類，而是分析在那個東西裡所有連通的點，各個連續的區域，和有哪些分別內外的邊界。假使從『拓撲學』的觀點來談『函數』的『連續性』，那麼 $|f(x) - f(x_0)| < \varepsilon$ 就是 $f(x_0)$ 的『鄰域』 neighborhood，而 $|x-x_0| < \delta$ 也就是 $x_0$ 的『鄰域』。所以函數上『一點』的連續性是說『這個點』的所有『指定鄰域』，都有一個『實數區間』── 鄰域的另一種說法 ── 與之『對應』，『此函數』將『此區間』映射到那個『指定鄰域』裡。

然而一個函數在『某個點』的『連續性』，並不能夠『確保』在『那個點』的『斜率存在』 ── 或說『可微分性』，比方說

$f(x) = \begin{cases}x & \mbox{if }x \ge 0, \\ 0 &\mbox{if }x < 0\end{cases}$

，當 $x > 0$ 時，『斜率』是 $f^{'}(x) = \frac{df(x)}{dx} = 1$ ，在 $x < 0$ 時，『斜率』為 $0$ ，然而 $x = 0$ 時『斜率』不存在！這使得我們必須研究一個函數在『每個點』之『鄰域』情況，於是數學步入了『解析的』 Analytic 時代。所謂『解析的』一詞是指『這類函數』在 $x = x_0$ 的『鄰域』，可以用『泰勒級數』來作展開

$T(x) = \sum \limits_{n=0}^{\infty} \frac{f^{(n)}(x_0)}{n!} (x-x_0)^{n}$

。於是一個『解析函數』在定義域的『每一點』上都是『無窮階可導』的。人們將『無窮階可導』的函數，又稱之為『光滑函數』 smooth function。然而『可導性』卻不同於『連續性』，因此又定義了『連續可導性』︰假使一個函數從『一階到 N 階』的導數都『存在』而且『連續』，我們稱之為 $C^{N}$ 類函數。舉例來說

$f(x) = \begin{cases}x^2\sin{(\tfrac{1}{x})} & \mbox{if }x \neq 0, \\ 0 &\mbox{if }x = 0\end{cases}$

雖然『一階導數』存在但是在 $x = 0$ 時，並不『連續』，所以它只能屬於 $C^{0}$ 類，而不是屬於 $C^{1}$ 類。

雖然一個『光滑函數』就屬於 $C^{\infty}$ 類，但是它可以不是『解析函數』，比方說

$f(x) = \begin{cases}e^{-\frac{1}{1-x^2}} & \mbox{ if } |x| < 1, \\ 0 &\mbox{ otherwise }\end{cases}$

是『光滑的』，然而在 $x = \pm 1$ 時無法用『泰勒級數』來作展開，因此不是『解析的』。

縱使人們覺得『連續』與『鄰近』以及『導數』和『光滑』彼此之間有聯繫，由於失去『直觀』的導引，『概念』卻又越來越『複雜』，因此『微積分』也就遠離了一般人的『理解』，彷彿鎖在『解析』與『極限』的『巴別塔』中！更不要說還有一些『很有用』卻是『很奇怪』的函數。舉例來說，『單位階躍』函數，又稱作『黑維塞階躍函數』 Heaviside step function ，可以定義如下

$H(x) = \begin{cases} 0, & x < 0 \\ \frac{1}{2}, & x = 0 \\ 1, & x > 0 \end{cases}$

，在 $x = 0$ 時是『不連續』的，它可以『解析』為

$H(x)=\lim \limits_{k \rightarrow \infty}\frac{1}{2}(1+\tanh kx)=\lim \limits_{k \rightarrow \infty}\frac{1}{1+\mathrm{e}^{-2kx}}$

，它的『微分』是 $\frac{dH(x)}{dx} = \delta(x)$ ，而且這個『狄拉克 $\delta(x)$ 函數』 Dirac Delta function 是這樣定義的

$\delta(x) = \begin{cases} +\infty, & x = 0 \\ 0, & x \ne 0 \end{cases}$

，滿足

$\int_{-\infty}^\infty \delta(x) \, dx = 1$

。怕是想『解析』一下都令人頭大，『極限』和『微分』與『積分』能不能『交換次序』，它必須滿足『什麼條件』，假使再加上『無限級數』求和，

$\operatorname{III}_T(t) \ \stackrel{\mathrm{def}}{=}\ \sum_{k=-\infty}^{\infty} \delta(t - k T) = \frac{1}{T}\operatorname{III}\left(\frac{t}{T}\right)$

，果真是我的天啊的吧！！

$f(x) = \begin{cases}x & \mbox{if }x \ge 0, \\ 0 &\mbox{if }x < 0\end{cases}$

$f(x) = \begin{cases}x^2\sin{(\tfrac{1}{x})} & \mbox{if }x \neq 0, \\ 0 &\mbox{if }x = 0\end{cases}$

$f'(x) = \begin{cases}-\mathord{\cos(\tfrac{1}{x})} + 2x\sin(\tfrac{1}{x}) & \mbox{if }x \neq 0, \\ 0 &\mbox{if }x = 0.\end{cases}$

$f(x) = \begin{cases}e^{-\frac{1}{1-x^2}} & \mbox{ if } |x| < 1, \\ 0 &\mbox{ otherwise }\end{cases}$

$f(x) := \begin{cases}e^{-\frac{1}{x}} & x > 0, \\ 0 & x \leq 0 \end{cases}$

$H(x) = \begin{cases} 0, & x < 0 \\ \frac{1}{2}, & x = 0 \\ 1, & x > 0 \end{cases}$

狄拉克 δ 函數
單位脈衝函數

$\delta_{a}(x) = \frac{1}{a \sqrt{\pi}} e^{- x^2 / a^2}$
$a \rightarrow 0$

一九六零年，德國數學家『亞伯拉罕‧魯濱遜』 Abraham Robinson 將『萊布尼茲』的微分直觀落實。用嚴謹的方法來定義和運算實數的『無窮小』與『無限大』，這就是數學史上著名的『非標準微積分』Non-standard calculus ，可說是『非標準分析』non-standard analysis 之父。

就像『複數』 $C$ 是『實數系』 $R$ 的『擴張』一樣，他將『實數系』增入了『無窮小』 infinitesimals 元素 $\delta x$ ，魯濱遜創造出『超實數』 hyperreals $r^{*} = r + \delta x$ ，形成了『超實數系』 $R^{*}$ 。那這個『無窮小』是什麼樣的『數』呢？對於『正無窮小』來說，任何給定的『正數』都比要它大，就『負無窮小』來講，它大於任何給定的『負數』。『零』也就自然的被看成『實數系』裡的『無窮小』的了。假使我們說兩個超實數 $a, b, \ a \neq b$ 是『無限的鄰近』 indefinitly close，記作 $a \approx b$ 是指 $b -a \approx 0$ 是個『無窮小』量。在這個觀點下，『無窮小』量不滿足『實數』的『阿基米德性質』。也就是說，對於任意給定的 $m$ 來講， $m \cdot \delta x$ 為『無窮小』量；而 $\frac{1}{\delta x}$ 是『無限大』量。然而在『系統』與『自然』的『擴張』下，『超實數』的『算術』符合所有一般『代數法則』。

有人把『超實數』想像成『數原子』，一個環繞著『無窮小』數的『實數』。就像『複數』有『實部』 $R_e$ 與『虛部』 $I_m$ 取值『運算』一樣，『超實數』也有一個取值『運算』叫做『標準部份函數』Standard part function

$st(r^{*}) = st(r + \delta x)$
$= st(r) + st(\delta x) = r + 0 = r$

。如此一個『函數』 $f(x)$ 在 $x_0$ 是『連續的』就可以表示成『如果 $x \approx x_0, \ x \neq x_0$ ，可以得到 $f(x) \approx f(x_0)$ 』。

假使 $y = x^2$ ，那麼 $y$ 的『斜率』就可以這麼計算

$\frac{dy}{dx} = st \left[ \frac{\Delta y}{\Delta x} \right] = st \left[ \frac{(x + \Delta x)^2 - x^2}{\Delta x} \right]$
$= st \left[2 x + \Delta x \right] = 2 x$