W!o+ 的《小伶鼬工坊演義》︰神經網絡【百家言】五

隨著 Michael Nielsen 先生說故事

On stories in neural networks

Question: How do you approach utilizing and researching machine learning techniques that are supported almost entirely empirically, as opposed to mathematically? Also in what situations have you noticed some of these techniques fail?Answer: You have to realize that our theoretical tools are very weak. Sometimes, we have good mathematical intuitions for why a particular technique should work. Sometimes our intuition ends up being wrong […] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.

Question and answer with neural networks researcher Yann LeCun

Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with “I’m very sympathetic to your point of view, but […]”. Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I’d rarely or never heard a questioner express their sympathy for the point of view of the speaker. At the time, I thought the prevalence of the question suggested that little genuine progress was being made in quantum foundations, and people were merely spinning their wheels. Later, I realized that assessment was too harsh. The speakers were wrestling with some of the hardest problems human minds have ever confronted. Of course progress was slow! But there was still value in hearing updates on how people were thinking, even if they didn’t always have unarguable new progress to report.

You may have noticed a verbal tic similar to “I’m very sympathetic […]” in the current book. To explain what we’re seeing I’ve often fallen back on saying “Heuristically, […]”, or “Roughly speaking, […]”, following up with a story to explain some phenomenon or other. These stories are plausible, but the empirical evidence I’ve presented has often been pretty thin. If you look through the research literature you’ll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence. What should we think about such stories?

In many parts of science – especially those parts that deal with simple phenomena – it’s possible to obtain very solid, very reliable evidence for quite general hypotheses. But in neural networks there are large numbers of parameters and hyper-parameters, and extremely complex interactions between them. In such extraordinarily complex systems it’s exceedingly difficult to establish reliable general statements. Understanding neural networks in their full generality is a problem that, like quantum foundations, tests the limits of the human mind. Instead, we often make do with evidence for or against a few specific instances of a general statement. As a result those statements sometimes later need to be modified or abandoned, when new evidence comes to light.

One way of viewing this situation is that any heuristic story about neural networks carries with it an implied challenge. For example, consider the statement I quoted earlier, explaining why dropout works*

*From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).

: “This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.” This is a rich, provocative statement, and one could build a fruitful research program entirely around unpacking the statement, figuring out what in it is true, what is false, what needs variation and refinement. Indeed, there is now a small industry of researchers who are investigating dropout (and many variations), trying to understand how it works, and what its limits are. And so it goes with many of the heuristics we’ve discussed. Each heuristic is not just a (potential) explanation, it’s also a challenge to investigate and understand in more detail.

Of course, there is not time for any single person to investigate all these heuristic explanations in depth. It’s going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn. Does this mean you should reject heuristic explanations as unrigorous, and not sufficiently evidence-based? No! In fact, we need such heuristics to inspire and guide our thinking. It’s like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. Later, those mistakes were corrected as we filled in our knowledge of geography. When you understand something poorly – as the explorers understood geography, and as we understand neural nets today – it’s more important to explore boldly than it is to be rigorously correct in every step of your thinking. And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning. Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.

───

 

本章也將步入尾聲。通常新興科技領域往往百家爭鳴,理論深耕卻總嫌不足!而成熟科學每每難以突破,期待靈感啟發之火石電光!此正所謂以『枯、榮』之『禪辯』耶?不過不論哪方總可借鑒歷史的『經驗』乎?就像『計算物理學』

Computational physics

Computational physics is the study and implementation of numerical analysis to solve problems in physics for which a quantitative theory already exists.[1] Historically, computational physics was the first application of modern computers in science, and is now a subset of computational science.

It is sometimes regarded as a subdiscipline (or offshoot) of theoretical physics, but others consider it an intermediate branch between theoretical and experimental physics, a third way that supplements theory and experiment.[2]

Computational_physics_diagram.svg

A representation of the multidisciplinary nature of computational physics both as an overlap of physics, applied mathematics, and computer science and as a bridge among them.[3]

Challenges in computational physics

Physics problems are in general very difficult to solve exactly. This is due to several (mathematical) reasons: lack of algebraic and/or analytic solubility, complexity, and chaos.

For example, – even apparently simple problems, such as calculating the wavefunction of an electron orbiting an atom in a strong electric field (Stark effect), may require great effort to formulate a practical algorithm (if one can be found); other cruder or brute-force techniques, such as graphical methods or root finding, may be required. On the more advanced side, mathematical perturbation theory is also sometimes used (a working is shown for this particular example here).

In addition, the computational cost and computational complexity for many-body problems (and their classical counterparts) tend to grow quickly. A macroscopic system typically has a size of the order of  10^{23} constituent particles, so it is somewhat of a problem. Solving quantum mechanical problems is generally of exponential order in the size of the system[citation needed] and for classical N-body it is of order N-squared.

Finally, many physical systems are inherently nonlinear at best, and at worst chaotic: this means it can be difficult to ensure any numerical errors do not grow to the point of rendering the ‘solution’ useless.[5]

───

 

已先品嚐個中三昧矣!!

務須注意數學模型之『適切性問題』??

Well-posed problem

The mathematical term well-posed problem stems from a definition given by Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that

  1. A solution exists
  2. The solution is unique
  3. The solution’s behavior changes continuously with the initial conditions.

Examples of archetypal well-posed problems include the Dirichlet problem for Laplace’s equation, and the heat equation with specified initial conditions. These might be regarded as ‘natural’ problems in that there are physical processes modelled by these problems.

Problems that are not well-posed in the sense of Hadamard are termed ill-posed. Inverse problems are often ill-posed. For example, the inverse heat equation, deducing a previous distribution of temperature from final data, is not well-posed in that the solution is highly sensitive to changes in the final data.

Continuum models must often be discretized in order to obtain a numerical solution. While solutions may be continuous with respect to the initial conditions, they may suffer from numerical instability when solved with finite precision, or with errors in the data. Even if a problem is well-posed, it may still be ill-conditioned, meaning that a small error in the initial data can result in much larger errors in the answers. An ill-conditioned problem is indicated by a large condition number.

If the problem is well-posed, then it stands a good chance of solution on a computer using a stable algorithm. If it is not well-posed, it needs to be re-formulated for numerical treatment. Typically this involves including additional assumptions, such as smoothness of solution. This process is known as regularization. Tikhonov regularization is one of the most commonly used for regularization of linear ill-posed problems.

───

 

或可知『正則化』之不得不耶!!!

Tikhonov regularization

Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as ridge regression, and with multiple independent discoveries, it is also variously known as the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg–Marquardt algorithm for non-linear least-squares problems.

Suppose that for a known matrix  A and vector  b, we wish to find a vector  \mathbf {x} such that

  A\mathbf {x} =\mathbf {b} .

The standard approach is ordinary least squares linear regression. However, if no  x satisfies the equation or more than one  x does—that is the solution is not unique—the problem is said not to be well posed. In such cases, ordinary least squares estimation leads to an overdetermined (over-fitted), or more often an underdetermined (under-fitted) system of equations. Most real-world phenomena have the effect of low-pass filters in the forward direction where  A maps  \mathbf {x} to  \mathbf {b} . Therefore, in solving the inverse-problem, the inverse mapping operates as a high-pass filter that has the undesirable tendency of amplifying noise (eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of  \mathbf {x} that is in the null-space of  A, rather than allowing for a model to be used as a prior for  \mathbf {x} . Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as

  \|A\mathbf {x} -\mathbf {b} \|^{2}

where  \left\|\cdot \right\| is the Euclidean norm. In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization:

  \|A\mathbf {x} -\mathbf {b} \|^{2}+\|\Gamma \mathbf {x} \|^{2}

for some suitably chosen Tikhonov matrix\Gamma . In many cases, this matrix is chosen as a multiple of the identity matrix ( \Gamma =\alpha I), giving preference to solutions with smaller norms; this is known as L2 regularization.[1] In other cases, lowpass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by  {\hat {x}}, is given by:

{\hat {x}}=(A^{T}A+\Gamma ^{T}\Gamma )^{-1}A^{T}\mathbf {b}

The effect of regularization may be varied via the scale of matrix  \Gamma . For  \Gamma =0 this reduces to the unregularized least squares solution provided that (ATA)−1 exists.

L2 regularization is used in many contexts aside from linear regression, such as classification with logistic regression or support vector machines,[2] and matrix factorization.[3]

……

Bayesian interpretation

Although at first the choice of the solution to this regularized problem may look artificial, and indeed the matrix  \Gamma seems rather arbitrary, the process can be justified from a Bayesian point of view. Note that for an ill-posed problem one must necessarily introduce some additional assumptions in order to get a unique solution. Statistically, the prior probability distribution of  x is sometimes taken to be a multivariate normal distribution. For simplicity here, the following assumptions are made: the means are zero; their components are independent; the components have the same standard deviation  \sigma _{x}. The data are also subject to errors, and the errors in b b are also assumed to be independent with zero mean and standard deviation  \sigma _{b}. Under these assumptions the Tikhonov-regularized solution is the most probable solution given the data and the a priori distribution of  x, according to Bayes’ theorem.[4]

If the assumption of normality is replaced by assumptions of homoskedasticity and uncorrelatedness of errors, and if one still assumes zero mean, then the Gauss–Markov theorem entails that the solution is the minimal unbiased estimator.[citation needed]

───