既然『機器學習』是以『資料』為『中心』之領域,難到能不重視所謂『data』的方方面面嗎?故而建議讀者先讀
/scipy-2018-sklearn
這三篇簡短筆記︰
03 Data formats, preparation, and representation [view]
04 Supervised learning: Training and test data [view]
10 Preparing a real-world dataset (titanic) [view]
※ 註︰由於恐有
Problems importing pandas.plotting
,所以作者決定升級也。
rock64@rock64:~$ python3 Python 3.5.3 (default, Sep 27 2018, 17:25:39) [GCC 6.3.0 20170516] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> pandas.__version__ '0.19.2' >>>
sudo pip3 install –upgrade pandas
多了解 一下 scikit-learn 之工具程式
5. Dataset loading utilities
The sklearn.datasets
package embeds some small toy datasets as introduced in the Getting Started section.
This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the ‘real world’.
To evaluate the impact of the scale of the dataset (n_samples
and n_features
) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data.
5.1. General dataset API
There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.
The dataset loaders. They can be used to load small standard datasets, described in the Toy datasets section.
The dataset fetchers. They can be used to download and load larger datasets, described in the Real world datasets section.
Both loaders and fetchers functions return a dictionary-like object holding at least two items: an array of shape n_samples
*n_features
with key data
(except for 20newsgroups) and a numpy array of length n_samples
, containing the target values, with key target
.
It’s also possible for almost all of these function to constrain the output to be a tuple containing only the data and the target, by setting the return_X_y
parameter to True
.
The datasets also contain a full description in their DESCR
attribute and some contain feature_names
and target_names
. See the dataset descriptions below for details.
The dataset generation functions. They can be used to generate controlled synthetic datasets, described in the Generated datasets section.
These functions return a tuple (X, y)
consisting of a n_samples
* n_features
numpy array X
and an array of lengthn_samples
containing the targets y
.
In addition, there are also miscellanous tools to load datasets of other formats or from other locations, described in the Loading other datasets section.
再進入 Aurélien Geron 所寫的第二章
新手這一端到老手那一端
專案也!
此時基本事物能駕輕就熟,或將更能掌握大篇幅論述也◎
02_end_to_end_machine_learning_project.ipynb