Rock It 《ML》scikit-learn 【一】

為了驗證派生三工作環境正常,所以跑了 Aurélien Géron 第一章之 jupyter-notebook 筆記本

handson-ml/01_the_machine_learning_landscape.ipynb

 

畢竟自己也是 scikit-learn 之新手,只覺得查來找去『程式庫呼叫』的說明也不是學習的好辦法!故而很想要知道 API 設計藍圖呀?

偶而讀到

API design for machine learning software: experiences from the scikit-learn project

Lars Buitinck (ILPS), Gilles Louppe, Mathieu Blondel, Fabian Pedregosa (INRIA Saclay – Ile de France), Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort (INRIA Saclay – Ile de France, LTCI), Jaques Grobler (INRIA Saclay – Ile de France), Robert Layton, Jake Vanderplas, Arnaud Joly, Brian Holt, Gaël Varoquaux (INRIA Saclay – Ile de France)

Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusability. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library.

Subjects: Machine Learning (cs.LG); Mathematical Software (cs.MS)
Journal reference: European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (2013)
Cite as: arXiv:1309.0238 [cs.LG]
  (or arXiv:1309.0238v1 [cs.LG] for this version)

Submission history

From: Gael Varoquaux [view email]
[v1] Sun, 1 Sep 2013 16:22:48 UTC (28 KB)

 

欣喜天緣湊巧呦!!

1309.0238.pdf

……

………

 

一時勇猛精進,彷彿可以鳥瞰文件結構哩??

 

Classification

Identifying to which category an object belongs to.

Regression

Predicting a continuous-valued attribute associated with an object.

Clustering

Automatic grouping of similar objects into sets.

Dimensionality reduction

Reducing the number of random variables to consider.

Model selection

Comparing, validating and choosing parameters and models.

Preprocessing

Feature extraction and normalization.

1.1. Generalized Linear Models

The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the input variables. In mathematical notion, if \hat{y} is the predicted value.

\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p

Across the module, we designate the vector w = (w_1,..., w_p) as coef_ and w_0 as intercept_.

To perform classification with generalized linear models, see Logistic regression.

1.1.1. Ordinary Least Squares

LinearRegression fits a linear model with coefficients w = (w_1, ..., w_p) to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:

\min_{w} {|| X w - y||_2}^2

../_images/sphx_glr_plot_ols_0011.png

LinearRegression will take in its fit method arrays X, y and will store the coefficients w of the linear model in its coef_member:

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
...                                       
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                 normalize=False)
>>> reg.coef_
array([0.5, 0.5])

However, coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.

Examples:

……

API Reference

This is the class and function reference of scikit-learn. Please refer to the full user guide for further details, as the class and function raw specifications may not be enough to give full guidelines on their uses. For reference on concepts repeated across the API, see Glossary of Common Terms and API Elements.

………

sklearn.linear_model.LinearRegression

class sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True,n_jobs=None)
Ordinary least squares Linear Regression.

Parameters:
fit_intercept : boolean, optional, default True

whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).

normalize : boolean, optional, default False

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.

copy_X : boolean, optional, default True

If True, X will be copied; else, it may be overwritten.

n_jobs : int or None, optional (default=None)

The number of jobs to use for the computation. This will only provide speedup for n_targets > 1 and sufficient large problems. None means 1 unless in a joblib.parallel_backend context. -1means using all processors. See Glossary for more details.

Attributes:
coef_ : array, shape (n_features, ) or (n_targets, n_features)

Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

intercept_ : array

Independent term in the linear model.

Notes

From the implementation point of view, this is just plain Ordinary Least Squares (scipy.linalg.lstsq) wrapped as a predictor object.

 

因此樂與同好者分享也。