W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】五

隨著 Michael Nielsen 先生說故事

On stories in neural networks

Question: How do you approach utilizing and researching machine learning techniques that are supported almost entirely empirically, as opposed to mathematically? Also in what situations have you noticed some of these techniques fail?Answer: You have to realize that our theoretical tools are very weak. Sometimes, we have good mathematical intuitions for why a particular technique should work. Sometimes our intuition ends up being wrong […] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.

Question and answer with neural networks researcher Yann LeCun

Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with “I’m very sympathetic to your point of view, but […]”. Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I’d rarely or never heard a questioner express their sympathy for the point of view of the speaker. At the time, I thought the prevalence of the question suggested that little genuine progress was being made in quantum foundations, and people were merely spinning their wheels. Later, I realized that assessment was too harsh. The speakers were wrestling with some of the hardest problems human minds have ever confronted. Of course progress was slow! But there was still value in hearing updates on how people were thinking, even if they didn’t always have unarguable new progress to report.

You may have noticed a verbal tic similar to “I’m very sympathetic […]” in the current book. To explain what we’re seeing I’ve often fallen back on saying “Heuristically, […]”, or “Roughly speaking, […]”, following up with a story to explain some phenomenon or other. These stories are plausible, but the empirical evidence I’ve presented has often been pretty thin. If you look through the research literature you’ll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence. What should we think about such stories?

In many parts of science – especially those parts that deal with simple phenomena – it’s possible to obtain very solid, very reliable evidence for quite general hypotheses. But in neural networks there are large numbers of parameters and hyper-parameters, and extremely complex interactions between them. In such extraordinarily complex systems it’s exceedingly difficult to establish reliable general statements. Understanding neural networks in their full generality is a problem that, like quantum foundations, tests the limits of the human mind. Instead, we often make do with evidence for or against a few specific instances of a general statement. As a result those statements sometimes later need to be modified or abandoned, when new evidence comes to light.

One way of viewing this situation is that any heuristic story about neural networks carries with it an implied challenge. For example, consider the statement I quoted earlier, explaining why dropout works*

*From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).

: “This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.” This is a rich, provocative statement, and one could build a fruitful research program entirely around unpacking the statement, figuring out what in it is true, what is false, what needs variation and refinement. Indeed, there is now a small industry of researchers who are investigating dropout (and many variations), trying to understand how it works, and what its limits are. And so it goes with many of the heuristics we’ve discussed. Each heuristic is not just a (potential) explanation, it’s also a challenge to investigate and understand in more detail.

Of course, there is not time for any single person to investigate all these heuristic explanations in depth. It’s going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn. Does this mean you should reject heuristic explanations as unrigorous, and not sufficiently evidence-based? No! In fact, we need such heuristics to inspire and guide our thinking. It’s like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. Later, those mistakes were corrected as we filled in our knowledge of geography. When you understand something poorly – as the explorers understood geography, and as we understand neural nets today – it’s more important to explore boldly than it is to be rigorously correct in every step of your thinking. And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning. Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.

───

 

本章也將步入尾聲。通常新興科技領域往往百家爭鳴,理論深耕卻總嫌不足!而成熟科學每每難以突破,期待靈感啟發之火石電光!此正所謂以『枯、榮』之『禪辯』耶?不過不論哪方總可借鑒歷史的『經驗』乎?就像『計算物理學』

Computational physics

Computational physics is the study and implementation of numerical analysis to solve problems in physics for which a quantitative theory already exists.[1] Historically, computational physics was the first application of modern computers in science, and is now a subset of computational science.

It is sometimes regarded as a subdiscipline (or offshoot) of theoretical physics, but others consider it an intermediate branch between theoretical and experimental physics, a third way that supplements theory and experiment.[2]

Computational_physics_diagram.svg

A representation of the multidisciplinary nature of computational physics both as an overlap of physics, applied mathematics, and computer science and as a bridge among them.[3]

Challenges in computational physics

Physics problems are in general very difficult to solve exactly. This is due to several (mathematical) reasons: lack of algebraic and/or analytic solubility, complexity, and chaos.

For example, – even apparently simple problems, such as calculating the wavefunction of an electron orbiting an atom in a strong electric field (Stark effect), may require great effort to formulate a practical algorithm (if one can be found); other cruder or brute-force techniques, such as graphical methods or root finding, may be required. On the more advanced side, mathematical perturbation theory is also sometimes used (a working is shown for this particular example here).

In addition, the computational cost and computational complexity for many-body problems (and their classical counterparts) tend to grow quickly. A macroscopic system typically has a size of the order of  10^{23} constituent particles, so it is somewhat of a problem. Solving quantum mechanical problems is generally of exponential order in the size of the system[citation needed] and for classical N-body it is of order N-squared.

Finally, many physical systems are inherently nonlinear at best, and at worst chaotic: this means it can be difficult to ensure any numerical errors do not grow to the point of rendering the ‘solution’ useless.[5]

───

 

已先品嚐個中三昧矣!!

務須注意數學模型之『適切性問題』??

Well-posed problem

The mathematical term well-posed problem stems from a definition given by Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that

  1. A solution exists
  2. The solution is unique
  3. The solution’s behavior changes continuously with the initial conditions.

Examples of archetypal well-posed problems include the Dirichlet problem for Laplace’s equation, and the heat equation with specified initial conditions. These might be regarded as ‘natural’ problems in that there are physical processes modelled by these problems.

Problems that are not well-posed in the sense of Hadamard are termed ill-posed. Inverse problems are often ill-posed. For example, the inverse heat equation, deducing a previous distribution of temperature from final data, is not well-posed in that the solution is highly sensitive to changes in the final data.

Continuum models must often be discretized in order to obtain a numerical solution. While solutions may be continuous with respect to the initial conditions, they may suffer from numerical instability when solved with finite precision, or with errors in the data. Even if a problem is well-posed, it may still be ill-conditioned, meaning that a small error in the initial data can result in much larger errors in the answers. An ill-conditioned problem is indicated by a large condition number.

If the problem is well-posed, then it stands a good chance of solution on a computer using a stable algorithm. If it is not well-posed, it needs to be re-formulated for numerical treatment. Typically this involves including additional assumptions, such as smoothness of solution. This process is known as regularization. Tikhonov regularization is one of the most commonly used for regularization of linear ill-posed problems.

───

 

或可知『正則化』之不得不耶!!!

Tikhonov regularization

Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as ridge regression, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.

Suppose that for a known matrix  A and vector  b, we wish to find a vector  \mathbf {x} such that

  A\mathbf {x} =\mathbf {b} .

The standard approach is ordinary least squares linear regression. However, if no  x satisfies the equation or more than one  x does—that is the solution is not unique—the problem is said not to be well posed. In such cases, ordinary least squares estimation leads to an overdetermined (over-fitted), or more often an underdetermined (under-fitted) system of equations. Most real-world phenomena have the effect of low-pass filters in the forward direction where  A maps  \mathbf {x} to  \mathbf {b} . Therefore, in solving the inverse-problem, the inverse mapping operates as a high-pass filter that has the undesirable tendency of amplifying noise (eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of  \mathbf {x} that is in the null-space of  A, rather than allowing for a model to be used as a prior for  \mathbf {x} . Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as

  \|A\mathbf {x} -\mathbf {b} \|^{2}

where  \left\|\cdot \right\| is the Euclidean norm. In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization:

  \|A\mathbf {x} -\mathbf {b} \|^{2}+\|\Gamma \mathbf {x} \|^{2}

for some suitably chosen Tikhonov matrix\Gamma . In many cases, this matrix is chosen as a multiple of the identity matrix ( \Gamma =\alpha I), giving preference to solutions with smaller norms; this is known as L2 regularization.[1] In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by  {\hat {x}}, is given by:

{\hat {x}}=(A^{T}A+\Gamma ^{T}\Gamma )^{-1}A^{T}\mathbf {b}

The effect of regularization may be varied via the scale of matrix  \Gamma . For  \Gamma =0 this reduces to the unregularized least squares solution provided that (ATA)−1 exists.

L2 regularization is used in many contexts aside from linear regression, such as classification with logistic regression or support vector machines,[2] and matrix factorization.[3]

……

Bayesian interpretation

Although at first the choice of the solution to this regularized problem may look artificial, and indeed the matrix  \Gamma seems rather arbitrary, the process can be justified from a Bayesian point of view. Note that for an ill-posed problem one must necessarily introduce some additional assumptions in order to get a unique solution. Statistically, the prior probability distribution of  x is sometimes taken to be a multivariate normal distribution. For simplicity here, the following assumptions are made: the means are zero; their components are independent; the components have the same standard deviation  \sigma _{x}. The data are also subject to errors, and the errors in b b are also assumed to be independent with zero mean and standard deviation  \sigma _{b}. Under these assumptions the Tikhonov-regularized solution is the most probable solution given the data and the a priori distribution of  x, according to Bayes’ theorem.[4]

If the assumption of normality is replaced by assumptions of homoskedasticity and uncorrelatedness of errors, and if one still assumes zero mean, then the Gauss–Markov theorem entails that the solution is the minimal unbiased estimator.[citation needed]

───

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】四

『暗室裡抓黑貓』者最好先確定那貓是『存在 』的吧!

存在性定理

數學中,存在性定理是一類以「存在……」開頭的定理的總稱。有時前面也會加上一些限定,比如說「對於所有的……,存在 ……」。形式上來說,存在性定理是指在定理的命題敘述中涉及存在量詞的定理。實際中,許多存在性定理並不會明確地用到「存在 」這個字眼,比如說「正弦函數連續的。」這個定理中並沒有出現「存在」一詞,但仍是一個存在性定理。因為「連續性」的定義是一個存在性的定義。

二十世紀初期曾經有過關於純粹的存在性定理的爭論。在數學結構主義的角度上,如果承認此種定理的存在,那麼數學的實用性將會降低。而與之相反的觀點認為抽象的手段可以達到數值分析所無法達到的目的。

……

‘Pure’ existence results

An existence theorem is purely theoretical if the proof given of it doesn’t also indicate a construction of whatever kind of object the existence of which is asserted. Such a proof is non-constructive, and the point is that the whole approach may not lend itself to construction.[2] In terms of algorithms, purely theoretical existence theorems bypass all algorithms for finding what is asserted to exist. They contrast with “constructive” existence theorems.[3] Many constructivist mathematicians work in extended logics (such as intuitionistic logic) where such existence statements are intrinsically weaker than their constructive counterparts.

Such purely theoretical existence results are in any case ubiquitous in contemporary mathematics. For example, John Nash‘s original proof of the existence of a Nash equilibrium, from 1951, was such an existence theorem. In 1962 a constructive approach was found.[4]

───

 

因此欲作『最佳化』計算者,豈可不考究解之『適切性』的呢?

最佳化

最佳化,是應用數學的一個分支,主要研究以下形式的問題:

數學表述

主要研究以下形式的問題:

給定一個函數 f:A\to \mathbb {R} ,尋找一個元素  \mathbf {x} ^{0}\in A使得對於所有  A中的  \mathbf {x} f(\mathbf {x} ^{0})\leq f(\mathbf {x} )(最小化) ;或者f(\mathbf {x} ^{0})\geq f(\mathbf {x} )(最大化)。

這類定式有時還稱為「數學規劃」(譬如,線性規劃)。許多現實和理論問題都可以建模成這樣的一般性框架。

典型的, A一般為歐幾里得空間  \mathbb {R} ^{n}中的子集,通常由一個 A必須滿足的約束等式或者不等式來規定。 A的元素被稱為是可行解。函數  f被稱為目標函數,或者代價函數。一個最小化(或者最大化)目標函數的可行解被稱為最優解

一般情況下,會存在若干個局部的極小值或者極大值。局部極小值  x^{*}定義為對於一些  \delta >0,以及所有的 x 滿足

\|\mathbf {x} -\mathbf {x} ^{*}\|\leq \delta ;

公式

f(\mathbf {x} ^{*})\leq f(\mathbf {x} )

成立。這就是說,在  \mathbf {x} ^{*}周圍的一些閉球上,所有的函數值都大於或者等於在該點的函數值。一般的,求局部極小值是容易的,但是要確保其為全域性的最小值,則需要一些附加性的條件,例如,該函數必須是凸函數

符號表示

最佳化問題通常有一些較特別的符號標示方法。例如:

  \mathrm {min} _{x\in \mathbb {R} }\;x^{2}+1

這是要求表達式x^{2}+1的最小值,這裡x取值為全體實數\mathbb {R} 。這個問題的最小值應該是  1,當 x=0

\mathrm {max} _{x\in \mathbb {R} }\;2x

這是要求表達式 2x的最大值,同樣地, x在全體實數上取值。對於這個問題,由於該表達式不是有上界的,因此不存在最大值,因此,答案應該是無限大,或者是不可定義的。

\operatorname {argmin} _{x\in [-\infty ;-1]}\;x^{2}+1\,

這是求使表達式x2+1 達到最小值時x的值。在這裡x被限定在區間[-∞ ,-1]之間,所以上式的值是-1。

───

 

此之所以 Michael Nielsen 先生這裡用

『知之為知之』、『不知為不知』是知也的精神

泛談『其它』人工神經元之『模型』︰

Other models of artificial neuron

Up to now we’ve built our neural networks using sigmoid neurons. In principle, a network built from sigmoid neurons can compute any function. In practice, however, networks built using other model neurons sometimes outperform sigmoid networks. Depending on the application, networks based on such alternate models may learn faster, generalize better to test data, or perhaps do both. Let me mention a couple of alternate model neurons, to give you the flavor of some variations in common use.

Perhaps the simplest variation is the tanh (pronounced “tanch”) neuron, which replaces the sigmoid function by the hyperbolic tangent function. The output of a tanh neuron with input x, weight vector w, and bias b is given by

\tanh(w \cdot x+b),  \ \ \ \ (109)

where tanh is, of course, the hyperbolic tangent function. It turns out that this is very closely related to the sigmoid neuron. To see this, recall that the tanh function is defined by

\tanh(z) \equiv \frac{e^z-e^{-z}}{e^z+e^{-z}}. \ \ \ \ (110)

With a little algebra it can easily be verified that

\sigma(z) = \frac{1+\tanh(z/2)}{2}, \ \ \ \ (111)

that is, tanh is just a rescaled version of the sigmoid function. We can also see graphically that the tanh function has the same shape as the sigmoid function,

Tanh

One difference between tanh neurons and sigmoid neurons is that the output from tanh neurons ranges from -1 to 1, not 0 to 1. This means that if you’re going to build a network based on tanh neurons you may need to normalize your outputs (and, depending on the details of the application, possibly your inputs) a little differently than in sigmoid networks.

Similar to sigmoid neurons, a network of tanh neurons can, in principle, compute any function*

*There are some technical caveats to this statement for both tanh and sigmoid neurons, as well as for the rectified linear neurons discussed below. However, informally it’s usually fine to think of neural networks as being able to approximate any function to arbitrary accuracy.

mapping inputs to the range -1 to 1. Furthermore, ideas such as backpropagation and stochastic gradient descent are as easily applied to a network of tanh neurons as to a network of sigmoid neurons.

Exercise

  • Prove the identity in Equation (111).

Which type of neuron should you use in your networks, the tanh or sigmoid? A priori the answer is not obvious, to put it mildly! However, there are theoretical arguments and some empirical evidence to suggest that the tanh sometimes performs better*

*See, for example, Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998), and Understanding the difficulty of training deep feedforward networks, by Xavier Glorot and Yoshua Bengio (2010).

. Let me briefly give you the flavor of one of the theoretical arguments for tanh neurons. Suppose we’re using sigmoid neurons, so all activations in our network are positive. Let’s consider the weights w^{l+1}_{jk} input to the jth neuron in the l +1th layer. The rules for backpropagation (see here) tell us that the associated gradient will be a^l_k \delta^{l+1}_j. Because the activations are positive the sign of this gradient will be the same as the sign of \delta^{l+1}_j. What this means is that if \delta^{l+1}_j is positive then all the weights w^{l+1}_{jk} will decrease during gradient descent, while if \delta^{l+1}_j is negative then all the weights w^{l+1}_{jk} will increase during gradient descent. In other words, all weights to the same neuron must either increase together or decrease together. That’s a problem, since some of the weights may need to increase while others need to decrease. That can only happen if some of the input activations have different signs. That suggests replacing the sigmoid by an activation function, such as tanh, which allows both positive and negative activations. Indeed, because tanh is symmetric about zero, \tanh(-z) = -\tanh(z), we might even expect that, roughly speaking, the activations in hidden layers would be equally balanced between positive and negative. That would help ensure that there is no systematic bias for the weight updates to be one way or the other.

How seriously should we take this argument? While the argument is suggestive, it’s a heuristic, not a rigorous proof that tanh neurons outperform sigmoid neurons. Perhaps there are other properties of the sigmoid neuron which compensate for this problem? Indeed, for many tasks the tanh is found empirically to provide only a small or no improvement in performance over sigmoid neurons. Unfortunately, we don’t yet have hard-and-fast rules to know which neuron types will learn fastest, or give the best generalization performance, for any particular application.

Another variation on the sigmoid neuron is the rectified linear neuron or rectified linear unit. The output of a rectified linear unit with input x, weight vector w, and bias b is given by

\max(0, w \cdot x+b). \ \ \ \ (112)

Graphically, the rectifying function \max(0, z) looks like this:

max(0,z)

Obviously such neurons are quite different from both sigmoid and tanh neurons. However, like the sigmoid and tanh neurons, rectified linear units can be used to compute any function, and they can be trained using ideas such as backpropagation and stochastic gradient descent.

When should you use rectified linear units instead of sigmoid or tanh neurons? Some recent work on image recognition*

*See, for example, What is the Best Multi-Stage Architecture for Object Recognition?, by Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato and Yann LeCun (2009), Deep Sparse Rectifier Neural Networks, by Xavier Glorot, Antoine Bordes, and Yoshua Bengio (2011), and ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012). Note that these papers fill in important details about how to set up the output layer, cost function, and regularization in networks using rectified linear units. I’ve glossed over all these details in this brief account. The papers also discuss in more detail the benefits and drawbacks of using rectified linear units. Another informative paper is Rectified Linear Units Improve Restricted Boltzmann Machines, by Vinod Nair and Geoffrey Hinton (2010), which demonstrates the benefits of using rectified linear units in a somewhat different approach to neural networks.

has found considerable benefit in using rectified linear units through much of the network. However, as with tanh neurons, we do not yet have a really deep understanding of when, exactly, rectified linear units are preferable, nor why. To give you the flavor of some of the issues, recall that sigmoid neurons stop learning when they saturate, i.e., when their output is near either 0 or 1. As we’ve seen repeatedly in this chapter, the problem is that \sigma' terms reduce the gradient, and that slows down learning. Tanh neurons suffer from a similar problem when they saturate. By contrast, increasing the weighted input to a rectified linear unit will never cause it to saturate, and so there is no corresponding learning slowdown. On the other hand, when the weighted input to a rectified linear unit is negative, the gradient vanishes, and so the neuron stops learning entirely. These are just two of the many issues that make it non-trivial to understand when and why rectified linear units perform better than sigmoid or tanh neurons.

I’ve painted a picture of uncertainty here, stressing that we do not yet have a solid theory of how activation functions should be chosen. Indeed, the problem is harder even than I have described, for there are infinitely many possible activation functions. Which is the best for any given problem? Which will result in a network which learns fastest? Which will give the highest test accuracies? I am surprised how little really deep and systematic investigation has been done of these questions. Ideally, we’d have a theory which tells us, in detail, how to choose (and perhaps modify-on-the-fly) our activation functions. On the other hand, we shouldn’t let the lack of a full theory stop us! We have powerful tools already at hand, and can make a lot of progress with those tools. Through the remainder of this book I’ll continue to use sigmoid neurons as our go-to neuron, since they’re powerful and provide concrete illustrations of the core ideas about neural nets. But keep in the back of your mind that these same ideas can be applied to other types of neuron, and that there are sometimes advantages in doing so.

───

 

期許未來者乎??!!

Artificial neuron

An artificial neuron is a mathematical function conceived as a model of biological neurons. Artificial neurons are the constitutive units in an artificial neural network. Depending on the specific model used they may be called a semi-linear unit, Nv neuron, binary neuron, linear threshold function, or McCulloch–Pitts (MCP) neuron. The artificial neuron receives one or more inputs (representing dendrites) and sums them to produce an output (representing a neuron’s axon). Usually the sums of each node are weighted, and the sum is passed through a non-linear function known as an activation function or transfer function. The transfer functions usually have a sigmoid shape, but they may also take the form of other non-linear functions, piecewise linear functions, or step functions. They are also often monotonically increasing, continuous, differentiable and bounded. The thresholding function is inspired to build logic gates referred to as threshold logic; with a renewed interest to build logic circuit resumbling brain processing. For example, new devices such as memristors have been extensively used to develop such logic in the recent times.[1]

The artificial neuron transfer function should not be confused with a linear system’s transfer function.

……

Basic structure

For a given artificial neuron, let there be m + 1 inputs with signals x0 through xm and weights w0 through wm. Usually, the x0 input is assigned the value +1, which makes it a bias input with wk0 = bk. This leaves only m actual inputs to the neuron: from x1 to xm.

The output of the kth neuron is:

y_{k}=\varphi \left(\sum _{{j=0}}^{m}w_{{kj}}x_{j}\right)

Where  \varphi (phi) is the transfer function.

Artificial neuron.png

The output is analogous to the axon of a biological neuron, and its value propagates to the input of the next layer, through a synapse. It may also exit the system, possibly as part of an output vector.

It has no learning process as such. Its transfer function weights are calculated and threshold value are predetermined.

Comparison to biological neurons

Artificial neurons are designed to mimic aspects of their biological counterparts.

  • Dendrites – In a biological neuron, the dendrites act as the input vector. These dendrites allow the cell to receive signals from a large (>1000) number of neighboring neurons. As in the above mathematical treatment, each dendrite is able to perform “multiplication” by that dendrite’s “weight value.” The multiplication is accomplished by increasing or decreasing the ratio of synaptic neurotransmitters to signal chemicals introduced into the dendrite in response to the synaptic neurotransmitter. A negative multiplication effect can be achieved by transmitting signal inhibitors (i.e. oppositely charged ions) along the dendrite in response to the reception of synaptic neurotransmitters.
  • Soma – In a biological neuron, the soma acts as the summation function, seen in the above mathematical description. As positive and negative signals (exciting and inhibiting, respectively) arrive in the soma from the dendrites, the positive and negative ions are effectively added in summation, by simple virtue of being mixed together in the solution inside the cell’s body.
  • Axon – The axon gets its signal from the summation behavior which occurs inside the soma. The opening to the axon essentially samples the electrical potential of the solution inside the soma. Once the soma reaches a certain potential, the axon will transmit an all-in signal pulse down its length. In this regard, the axon behaves as the ability for us to connect our artificial neuron to other artificial neurons.

Unlike most artificial neurons, however, biological neurons fire in discrete pulses. Each time the electrical potential inside the soma reaches a certain threshold, a pulse is transmitted down the axon. This pulsing can be translated into continuous values. The rate (activations per second, etc.) at which an axon fires converts directly into the rate at which neighboring cells get signal ions introduced into them. The faster a biological neuron fires, the faster nearby neurons accumulate electrical potential (or lose electrical potential, depending on the “weighting” of the dendrite that connects to the neuron that fired). It is this conversion that allows computer scientists and mathematicians to simulate biological neural networks using artificial neurons which can output distinct values (often from −1 to 1).

───

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】三

秦失其鹿,天下共逐之。一個重要的演算法,通常會有許多變化。Michael Nielsen 先生此處寫給逐鹿中原者耶??

Momentum-based gradient descent: Intuitively, the advantage Hessian optimization has is that it incorporates not just information about the gradient, but also information about how the gradient is changing. Momentum-based gradient descent is based on a similar intuition, but avoids large matrices of second derivatives. To understand the momentum technique, think back to our original picture of gradient descent, in which we considered a ball rolling down into a valley. At the time, we observed that gradient descent is, despite its name, only loosely similar to a ball falling to the bottom of a valley. The momentum technique modifies gradient descent in two ways that make it more similar to the physical picture. First, it introduces a notion of “velocity” for the parameters we’re trying to optimize. The gradient acts to change the velocity, not (directly) the “position”, in much the same way as physical forces change the velocity, and only indirectly affect position. Second, the momentum method introduces a kind of friction term, which tends to gradually reduce the velocity.

Let’s give a more precise mathematical description. We introduce velocity variables v = v_1, v_2, \ldots, one for each corresponding w_j variable*

*In a neural net the w_j variables would, of course, include all weights and biases.

. Then we replace the gradient descent update rule w \rightarrow w'= w-\eta \nabla C by

v & \rightarrow & v' = \mu v - \eta \nabla C \ \ \ \ (107)
w & \rightarrow & w' = w+v'. \ \ \ \ (108)

In these equations, \mu is a hyper-parameter which controls the amount of damping or friction in the system. To understand the meaning of the equations it’s helpful to first consider the case where \mu = 1, which corresponds to no friction. When that’s the case, inspection of the equations shows that the “force” \nabla C is now modifying the velocity, v, and the velocity is controlling the rate of change of w. Intuitively, we build up the velocity by repeatedly adding gradient terms to it. That means that if the gradient is in (roughly) the same direction through several rounds of learning, we can build up quite a bit of steam moving in that direction. Think, for example, of what happens if we’re moving straight down a slope:

With each step the velocity gets larger down the slope, so we move more and more quickly to the bottom of the valley. This can enable the momentum technique to work much faster than standard gradient descent. Of course, a problem is that once we reach the bottom of the valley we will overshoot. Or, if the gradient should change rapidly, then we could find ourselves moving in the wrong direction. That’s the reason for the \mu hyper-parameter in (107). I said earlier that \mu controls the amount of friction in the system; to be a little more precise, you should think of 1 - \mu as the amount of friction in the system. When \mu = 1, as we’ve seen, there is no friction, and the velocity is completely driven by the gradient \nabla C. By contrast, when \mu = 0 there’s a lot of friction, the velocity can’t build up, and Equations (107) and (108) reduce to the usual equation for gradient descent, w \rightarrow w'=w-\eta \nabla C. In practice, using a value of \mu intermediate between 0 and 1 can give us much of the benefit of being able to build up speed, but without causing overshooting. We can choose such a value for \mu using the held-out validation data, in much the same way as we select \eta and \lambda.

I’ve avoided naming the hyper-parameter \mu up to now. The reason is that the standard name for \mu is badly chosen: it’s called the momentum co-efficient. This is potentially confusing, since \mu is not at all the same as the notion of momentum from physics. Rather, it is much more closely related to friction. However, the term momentum co-efficient is widely used, so we will continue to use it.

A nice thing about the momentum technique is that it takes almost no work to modify an implementation of gradient descent to incorporate momentum. We can still use backpropagation to compute the gradients, just as before, and use ideas such as sampling stochastically chosen mini-batches. In this way, we can get some of the advantages of the Hessian technique, using information about how the gradient is changing. But it’s done without the disadvantages, and with only minor modifications to our code. In practice, the momentum technique is commonly used, and often speeds up learning.

Exercise

  • What would go wrong if we used \mu > 1 in the momentum technique?
  • What would go wrong if we used \mu < 0 in the momentum technique?

 

如果 \mu > 1{\mu}^k \to \infty , \ if \ k \to \infty 恐將不受制於 \nabla C 之『力』矣。假使 \mu <0{\mu}^{2k} 為正, {\mu}^{2k+1} 為負,恐會震盪也。

Problem

  • Add momentum-based stochastic gradient descent to network2.py.

Other approaches to minimizing the cost function: Many other approaches to minimizing the cost function have been developed, and there isn’t universal agreement on which is the best approach. As you go deeper into neural networks it’s worth digging into the other techniques, understanding how they work, their strengths and weaknesses, and how to apply them in practice. A paper I mentioned earlier*

*Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998).

introduces and compares several of these techniques, including conjugate gradient descent and the BFGS method (see also the closely related limited-memory BFGS method, known as L-BFGS). Another technique which has recently shown promising results*

*See, for example, On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton (2012).

is Nesterov’s accelerated gradient technique, which improves on the momentum technique. However, for many problems, plain stochastic gradient descent works well, especially if momentum is used, and so we’ll stick to stochastic gradient descent through the remainder of this book.

───

 

尚須注意各家記號法之差異,解釋方式的不同,以免徒增困擾乎! !

Stochastic gradient descent

Stochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD, tries to find minimums or maximums by iteration.

……

Momentum

Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams‘ seminal paper on backpropagation learning.[10] stochastic gradient descent with momentum remembers the update Δ w at each iteration, and determines the next update as a convex combination of the gradient and the previous update:[11]

\Delta w:=\eta \nabla Q_{i}(w)+\alpha \Delta w
  {\displaystyle w:=w-\Delta w}

or as a mathematically equivalent formulation:[12]

{\displaystyle \Delta w:=-\eta \nabla Q_{i}(w)+\alpha \Delta w}
{\displaystyle w:=w+\Delta w}

that leads to:

{\displaystyle w:=w-\eta \nabla Q_{i}(w)+\alpha \Delta w}

where the parameter  w which minimizes  Q(w) is to be estimated, and  \eta is a step size (sometimes called the learning rate in machine learning).

The name momentum stems from an analogy to momentum in physics: the weight vector, thought of as a particle traveling through parameter space,[10] incurs acceleration from the gradient of the loss (“force“). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully for several decades.[13]

───

 

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】二

天下事因大勢所趨,故英雄所見略同。Michael Nielsen 先生講

Variations on stochastic gradient descent

Stochastic gradient descent by backpropagation has served us well in attacking the MNIST digit classification problem. However, there are many other approaches to optimizing the cost function, and sometimes those other approaches offer performance superior to mini-batch stochastic gradient descent. In this section I sketch two such approaches, the Hessian and momentum techniques.

Hessian technique: To begin our discussion it helps to put neural networks aside for a bit. Instead, we’re just going to consider the abstract problem of minimizing a cost function C which is a function of many variables, w = w_1, w_2, \ldots, so C = C(w). By Taylor’s theorem, the cost function can be approximated near a point w by

C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j \partial w_k} \Delta w_k + \ldots \ \ \ \ (103)

We can rewrite this more compactly as

C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w + \ldots, \ \ \ \ (104)

where \nabla C is the usual gradient vector, and H is a matrix known as the Hessian matrix, whose jkth entry is \partial^2 C / \partial w_j \partial w_k. Suppose we approximate C by discarding the higher-order terms represented by \ldots above,

C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w. \ \ \ \ (105)

Using calculus we can show that the expression on the right-hand side can be minimized*

*Strictly speaking, for this to be a minimum, and not merely an extremum, we need to assume that the Hessian matrix is positive definite. Intuitively, this means that the function C looks like a valley locally, not a mountain or a saddle.

by choosing

\Delta w = -H^{-1} \nabla C. \ \ \ \ (106)

Provided (105) is a good approximate expression for the cost function, then we’d expect that moving from the point w to w+\Delta w = w-H^{-1} \nabla C should significantly decrease the cost function. That suggests a possible algorithm for minimizing the cost:

  • Choose a starting point, w.
  • Update w to a new point w' = w-H^{-1} \nabla C, where the Hessian H and \nabla C are computed at w.
  • Update w' to a new point w{'}{'} = w'-H'^{-1} \nabla' C, where the Hessian H' and \nabla' C are computed at w'.
  • \ldots

In practice, (105) is only an approximation, and it’s better to take smaller steps. We do this by repeatedly changing w by an amount \Delta w = -\eta H^{-1} \nabla C, where \eta is known as the learning rate.

This approach to minimizing a cost function is known as the Hessian technique or Hessian optimization. There are theoretical and empirical results showing that Hessian methods converge on a minimum in fewer steps than standard gradient descent. In particular, by incorporating information about second-order changes in the cost function it’s possible for the Hessian approach to avoid many pathologies that can occur in gradient descent. Furthermore, there are versions of the backpropagation algorithm which can be used to compute the Hessian.

If Hessian optimization is so great, why aren’t we using it in our neural networks? Unfortunately, while it has many desirable properties, it has one very undesirable property: it’s very difficult to apply in practice. Part of the problem is the sheer size of the Hessian matrix. Suppose you have a neural network with 10^7 weights and biases. Then the corresponding Hessian matrix will contain 10^7 \times 10^7 = 10^{14} entries. That’s a lot of entries! And that makes computing H^{-1} \nabla C extremely difficult in practice. However, that doesn’t mean that it’s not useful to understand. In fact, there are many variations on gradient descent which are inspired by Hessian optimization, but which avoid the problem with overly-large matrices. Let’s take a look at one such technique, momentum-based gradient descent.

───

 

與史丹佛 Stanford CS 課堂說

cs231n.github.io/neural-networks-3

Second order methods

A second, popular group of methods for optimization in context of deep learning is based on Newton’s method, which iterates the following update:

x \leftarrow x - [H f(x)]^{-1} \nabla f(x)

Here, H f(x) is the Hessian matrix, which is a square matrix of second-order partial derivatives of the function. The term \nabla f(x) is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods.

However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of quasi-Newton methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is L-BFGS, which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed).

However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research.

In practice, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov’s) momentum are more standard because they are simpler and scale more easily.

Additional references:

  • Large Scale Distributed Deep Networks is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization.
  • SFO algorithm strives to combine the advantages of SGD with advantages of L-BFGS.

……

【參考網頁】

cs231n.github.io

cs231n.github.io/neural-networks-1

cs231n.github.io/neural-networks-2

cs231n.github.io/neural-networks-3

───

 

幾乎如出一轍,誠非偶然之事耶!!

亦須了此牛頓法也頗有來歷的乎??

Newton’s method in optimization

In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function f (i.e. solutions to the equation f(x)=0). In optimization, Newton’s method is applied to the derivative f of a twice-differentiable function f to find the roots of the derivative (solutions to f ′(x)=0), also known as the stationary points of f.

Method

In the one-dimensional problem, Newton’s method attempts to construct a sequence xn from an initial guess x0 that converges towards some value x* satisfying f ′(x*)=0. This x* is a stationary point of f.

The second order Taylor expansion fT(x) of f around xn is:

.

We want to find Δx such that f(xn + Δx) is maximum. We seek to solve the equation that sets the derivative of this last expression with respect to Δx equal to zero:

.

For the value of Δx = −f ′(xn) / f ″(xn), which is the solution of this equation, it can be hoped that xn+1 = xn + Δx = xnf ′(xn) / f ″(xn) will be closer to a stationary point x*. Provided that f is a twice-differentiable function and other technical conditions are satisfied, the sequence x1, x2, … will converge to a point x* satisfying f ′(x*)=0.

220px-Newton_optimization_vs_grad_descent.svg

A comparison of gradient descent (green) and Newton’s method (red) for minimizing a function (with small step sizes). Newton’s method uses curvature information to take a more direct route.

Geometric interpretation

The geometric interpretation of Newton’s method is that at each iteration one approximates f(x) by a quadratic function around xn, and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if f(x) happens to be a quadratic function, then the exact extremum is found in one step.

Higher dimensions

The above iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, f(x), and the reciprocal of the second derivative with the inverse of the Hessian matrix, H f(x). One obtains the iterative scheme

Often Newton’s method is modified to include a small step size γ ∈ (0,1) instead of γ = 1

This is often done to ensure that the Wolfe conditions are satisfied at each step xnxn+1 of the iteration. For step sizes other than 1, the method is often referred to as the relaxed Newton’s method.

Where applicable, Newton’s method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood N such that, if we start with x0N, Newton’s method with step size γ = 1 converges quadratically (if the Hessian is invertible and a Lipschitz continuous function of x in that neighborhood).

───

Hessian matrix

Use in optimization

Hessian matrices are used in large-scale optimization problems within Newton-type methods because they are the coefficient of the quadratic term of a local Taylor expansion of a function. That is,

{\displaystyle y=f(\mathbf {x} +\Delta \mathbf {x} )\approx f(\mathbf {x} )+\nabla f(\mathbf {x} )^{\mathrm {T} }\Delta \mathbf {x} +{\frac {1}{2}}\Delta \mathbf {x} ^{\mathrm {T} }\mathbf {H} (\mathbf {x} )\Delta \mathbf {x} }

where f is the gradient (f/x1, …, f/xn). Computing and storing the full Hessian matrix takes Θ(n2) memory, which is infeasible for high-dimensional functions such as the loss functions of neural nets, conditional random fields, and other statistical models with large numbers of parameters. For such situations, truncated-Newton and quasi-Newton algorithms have been developed. The latter family of algorithms use approximations to the Hessian; one of the most popular quasi-Newton algorithms is BFGS.[7]

Such approximations may use the fact that an optimization algorithm uses the Hessian only as a linear operator H(v), and proceed by first noticing that the Hessian also appears in the local expansion of the gradient:

\nabla f({\mathbf {x}}+\Delta {\mathbf {x}})=\nabla f({\mathbf {x}})+{\mathbf {H}}(\Delta {\mathbf {x}})+O(\|\Delta {\mathbf {x}}\|^{2})

Letting Δx = rv for some scalar r, this gives

{\mathbf {H}}(\Delta {\mathbf {x}})={\mathbf {H}}(r{\mathbf {v}})=r{\mathbf {H}}({\mathbf {v}})=\nabla f({\mathbf {x}}+r{\mathbf {v}})-\nabla f({\mathbf {x}})+O(r^{2}),

i.e.,

{\mathbf {H}}({\mathbf {v}})={\frac {1}{r}}{\Bigl [}\nabla f({\mathbf {x}}+r{\mathbf {v}})-\nabla f({\mathbf {x}}){\Bigr ]}+O(r)

so if the gradient is already computed, the approximate Hessian can be computed by a linear (in the size of the gradient) number of scalar operations. (While simple to program, this approximation scheme is not numerically stable since r has to be made small to prevent error due to the O(r) term, but decreasing it loses precision in the first term.[8])

───

 

 

 

 

 

 

 

 

 

 

 

 

W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】一

新唐書/卷176
韓愈

韓愈,字退之,鄧州南陽人。七世祖茂,有功於後魏,封安定王。父仲卿,為武昌令,有美政,既去,縣人刻石頌德。終秘書郎。愈生三歲而孤,隨伯兄會貶官嶺表。會卒,嫂鄭鞠之。愈自知讀書,日記數千百言,比長,盡能通《六經》、百家學。擢進士第。會董晉為宣武節度使,表署觀察推官。晉卒,愈從喪出,不四日,汴軍亂,乃去。依武甯節度使張建封,建封辟府推官。操行堅正,鯁言無所忌。調四門博士,遷監察禦史。上疏極論宮市,德宗怒,貶陽山令。有愛在民,民生子多以其姓字之。改江陵法曹參軍。元和初 ,權知國子博士,分司東都,三歲為真。改都官員外郎,即拜河南令。遷職方員外郎。

……

每言文章自漢司馬相如、太史公、劉向、揚雄後,作者不世出,故愈深探本元,卓然樹立,成一家言。其《原道》、《原性》、《師說》等數十篇,皆奧衍閎深,與孟軻、揚雄相表裏而佐佑《六經》雲。至它文,造端置辭,要為不襲蹈前人者。然惟愈為之,沛然若有餘,至其徒李翱、李漢、皇甫湜從而效之,遽不及遠甚。從愈游者,若孟郊、張籍,亦皆自名於時。

───

 

欲成一家言者,須先讀百家書。深思熟慮方能去蕪存菁,正反論難方可明辨要點。或此正是 Michael Nielsen 先生之所以宣講︰

Other techniques

Each technique developed in this chapter is valuable to know in its own right, but that’s not the only reason I’ve explained them. The larger point is to familiarize you with some of the problems which can occur in neural networks, and with a style of analysis which can help overcome those problems. In a sense, we’ve been learning how to think about neural nets. Over the remainder of this chapter I briefly sketch a handful of other techniques. These sketches are less in-depth than the earlier discussions, but should convey some feeling for the diversity of techniques available for use in neural networks.

───

 

【輔翼資料】

Awesome Deep Learning Awesome

Table of Contents

Free Online Books

  1. Deep Learning by Yoshua Bengio, Ian Goodfellow and Aaron Courville (05/07/2015)
  2. Neural Networks and Deep Learning by Michael Nielsen (Dec 2014)
  3. Deep Learning by Microsoft Research (2013)
  4. Deep Learning Tutorial by LISA lab, University of Montreal (Jan 6 2015)
  5. neuraltalk by Andrej Karpathy : numpy-based RNN/LSTM implementation
  6. An introduction to genetic algorithms
  7. Artificial Intelligence: A Modern Approach
  8. Deep Learning in Neural Networks: An Overview

Courses

  1. Machine Learning – Stanford by Andrew Ng in Coursera (2010-2014)
  2. Machine Learning – Caltech by Yaser Abu-Mostafa (2012-2014)
  3. Machine Learning – Carnegie Mellon by Tom Mitchell (Spring 2011)
  4. Neural Networks for Machine Learning by Geoffrey Hinton in Coursera (2012)
  5. Neural networks class by Hugo Larochelle from Université de Sherbrooke (2013)
  6. Deep Learning Course by CILVR lab @ NYU (2014)
  7. A.I – Berkeley by Dan Klein and Pieter Abbeel (2013)
  8. A.I – MIT by Patrick Henry Winston (2010)
  9. Vision and learning – computers and brains by Shimon Ullman, Tomaso Poggio, Ethan Meyers @ MIT (2013)
  10. Convolutional Neural Networks for Visual Recognition – Stanford by Fei-Fei Li, Andrej Karpathy (2015)
  11. Convolutional Neural Networks for Visual Recognition – Stanford by Fei-Fei Li, Andrej Karpathy (2016)
  12. Deep Learning for Natural Language Processing – Stanford
  13. Neural Networks – usherbrooke
  14. Machine Learning – Oxford (2014-2015)
  15. Deep Learning – Nvidia (2015)
  16. Graduate Summer School: Deep Learning, Feature Learning by Geoffrey Hinton, Yoshua Bengio, Yann LeCun, Andrew Ng, Nando de Freitas and several others @ IPAM, UCLA (2012)
  17. Deep Learning – Udacity/Google by Vincent Vanhoucke and Arpan Chakraborty (2016)
  18. Deep Learning – UWaterloo by Prof. Ali Ghodsi at University of Waterloo (2015)

………