Rock It 《ML》scikit-learn 【三】

當我們仔細閱讀了

An introduction to machine learning with scikit-learn

Section contents

In this section, we introduce the machine learning vocabulary that we use throughout scikit-learn and give a simple learning example.

Machine learning: the problem setting

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

Learning problems fall into a few categories:

  • supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go to the scikit-learn supervised learning page).This problem can be either:

    • classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
    • regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
  • unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization (Click here to go to the Scikit-Learn unsupervised learning page).

也查找瀏覽過

 

User Guide

……

甚至盯著瞧了半天

Choosing the right estimator

Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.

Different estimators are better suited for different types of data and different problems.

The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.

Click on any estimator in the chart below to see its documentation.Move mouse over image

───

若是不能掌握

/scipy-2018-sklearn

Scipy 2018 scikit-learn tutorial by Guillaume Lemaitre and Andreas Mueller

Instructors


This repository will contain the teaching material and other info associated with our scikit-learn tutorial at SciPy 2018 held July 9-15 in Austin, Texas.

Parts 1 to 12 make up the morning session, while parts 13 to 23 will be presented in the afternoon (approximately)

Schedule:

The 2-part tutorial will be held on Tuesday, July 10, 2018.

Obtaining the Tutorial Material

If you have a GitHub account, it is probably most convenient if you clone or fork the GitHub repository. You can clone the repository by running:

git clone https://github.com/amueller/scipy-2018-sklearn.git

 

02 Scientific Computing Tools for Python: NumPy, SciPy, and matplotlib

………

怕還難入門哩☻

有興趣者務先認識 SciPy 生態系統基本核心軟件也☆

The SciPy ecosystem

Scientific computing in Python builds upon a small core of packages:

  • Python, a general purpose programming language. It is interpreted and dynamically typed and is very suited for interactive work and quick prototyping, while being powerful enough to write large applications in.
  • NumPy, the fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them.
  • The SciPy library, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more.
  • Matplotlib, a mature and popular plotting package, that provides publication-quality 2D plotting as well as rudimentary 3D plotting

On this base, the SciPy ecosystem includes general and specialised tools for data management and computation, productive experimentation and high-performance computing. Below we overview some key packages, though there are many more relevant packages.

Data and computation:

  • pandas, providing high-performance, easy to use data structures.
  • SymPy, for symbolic mathematics and computer algebra.
  • scikit-image is a collection of algorithms for image processing.
  • scikit-learn is a collection of algorithms and tools for machine learning.
  • h5py and PyTables can both access data stored in the HDF5 format.

Productivity and high-performance computing:

  • IPython, a rich interactive interface, letting you quickly process data and test ideas.
  • The Jupyter notebook provides IPython functionality and more in your web browser, allowing you to document your computation in an easily reproducible form.
  • Cython extends Python syntax so that you can conveniently build C extensions, either to speed up critical code, or to integrate with C/C++ libraries.
  • Dask, Joblib or IPyParallel for distributed processing with a focus on numeric data.

Quality assurance:

  • nose, a framework for testing Python code, being phased out in preference for pytest.
  • numpydoc, a standard and library for documenting Scientific Python libraries.