教育和學習︰ Up《grade》【六.五】

維基百科上有一個『機器學習』研究資料庫表列︰

List of datasets for machine learning research

These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals and other publications. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervisedand semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4][5] This list aggregates high-quality datasets that have been shown to be of value to the machine learning research community from multiple different data repositories to provide greater coverage of the topic than is otherwise available.

………

 

固然可以方便瀏覽及查詢,不過未必容易知道如何取得哩。

特此列出一二公開資料庫網址,以利神經網路學習者也︰

Open Data for Deep Learning

Here you’ll find an organized list of interesting, high-quality datasets for machine learning research. We welcome your contributions for curating this list! You can find other lists of such datasets on Wikipedia, for example.

Recent Additions

Natural-Image Datasets

  • MNIST: handwritten digits: The most commonly used sanity check. Dataset of 25×25, centered, B&W handwritten digits. It is an easy task — just because something works on MNIST, doesn’t mean it works.
  • CIFAR10 / CIFAR100: 32×32 color images with 10 / 100 categories. Not commonly used anymore, though once again, can be an interesting sanity check.
  • Caltech 101: Pictures of objects belonging to 101 categories.
  • Caltech 256: Pictures of objects belonging to 256 categories.
  • STL-10 dataset: is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Like CIFAR-10 with some modifications.
  • The Street View House Numbers (SVHN): House numbers from Google Street View. Think of this as recurrent MNIST in the wild.
  • NORB: Binocular images of toy figurines under various illumination and pose.
  • Pascal VOC: Generic image Segmentation / classification — not terribly useful for building real-world image annotation, but great for baselines
  • Labelme: A large dataset of annotated images.
  • ImageNet: The de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category; WordNet; hierarchy from ImageNet.
  • LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
  • MS COCO: Generic image understanding / captioning, with an associated competition.
  • COIL 20: Different objects imaged at every angle in a 360 rotation.
  • COIL100 : Different objects imaged at every angle in a 360 rotation.
  • Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.

……

About OpenSLR

OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition. We intend to be a convenient place for anyone to put resources that they have created, so that they can be downloaded publicly.

Part of our goal is to mirror software available elsewhere, in order to provide a failover location. We are starting by mirroring some software which is used in the Kaldi scripts. We plan to make it easy for others in turn to mirror this site; please ask us for details.

We aim to provide a central, hassle-free place for others to put their speech resources. For more information, see here .

For a list of resources, please click on resources above.

If you want to download things from this site, please download them one at a time, and please don’t use any fancy software– just download things from your browser or use ‘wget’. We have noticed a number of people who seem to be trying to download many things simultaneously, and we have had to block their IPs in order to avoid site-wide slowdown. We also had to add a firewall rule to drop connections from hosts with more than 5 simultaneous connections. If you want to create a mirror of this site, just ask us and we’ll help you set it up. A mirror in China would be particularly appreciated, since most of our problematic http requests seem to come from there.

………