如果了解了『頻譜圖』
Spectrogram
A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time or some other variable. Spectrograms are sometimes called spectral waterfalls, voiceprints, or voicegrams.
Spectrograms can be used to identify spoken words phonetically, and to analyse the various calls of animals. They are used extensively in the development of the fields of music, sonar, radar, and speech processing,[1] seismology, etc.
The instrument that generates a spectrogram is called a spectrograph.
The sample outputs on the right show a select block of frequencies going up the vertical axis, and time on the horizontal axis.
Typical spectrogram of the spoken words “nineteenth century”. The lower frequencies are more dense because it is a male voice. The legend to the right shows that the color intensity increases with the density.
Spectrogram of
3D surface spectrogram of a part from a music piece.
Format
A common format is a graph with two geometric dimensions: the horizontal axis represents time or rpm, the vertical axis is frequency; a third dimension indicating the amplitude of a particular frequency at a particular time is represented by the intensity or color of each point in the image.
There are many variations of format: sometimes the vertical and horizontal axes are switched, so time runs up and down; sometimes the amplitude is represented as the height of a 3D surface instead of color or intensity. The frequency and amplitude axes can be either linear or logarithmic, depending on what the graph is being used for. Audio would usually be represented with a logarithmic amplitude axis (probably in decibels, or dB), and frequency would be linear to emphasize harmonic relationships, or logarithmic to emphasize musical, tonal relationships.
Generation
Spectrograms are usually created in one of two ways: approximated as a filterbank that results from a series of bandpass filters (this was the only way before the advent of modern digital signal processing), or calculated from the time signal using the FFT. These two methods actually form two different Time-Frequency Distributions, but are equivalent under some conditions.
The bandpass filters method usually uses analog processing to divide the input signal into frequency bands; the magnitude of each filter’s output controls a transducer that records the spectrogram as an image on paper.[2]
Creating a spectrogram using the FFT is a digital process. Digitally sampled data, in the time domain, is broken up into chunks, which usually overlap, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk. Each chunk then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time. The spectrums or time plots are then “laid side by side” to form the image or a three-dimensional surface,[3] or slightly overlapped in various ways, windowing.
The spectrogram of a signal s(t) can be estimated by computing the squared magnitude of the STFT of the signal s(t), as follows:[4]
───
也許可以嘗試讀讀應用範例︰
Speech Recognition with BVLC caffe
Speech Recognition with the caffe deep learning framework
UPDATE: We are migrating to tensorflow
This project is quite fresh and only the first of three milestones is accomplished: Even now it might be useful if you just want to train a handful of commands/options (1,2,3..yes/no/cancel/…)
1) training spoken numbers:
- get spectogram training images from http://pannous.net/spoken_numbers.tar (470 MB)
- start ./train.sh
- test with
ipython notebook test-speech-recognition.ipynb
orcaffe test ...
or<caffe-root>/python/classify.py
- 99% accuracy, nice!
- online recognition and learning with
./recognition-server.py
and./record.py
scripts
Sample spectrogram, Karen uttering ‘zero’ with 160 words per minute.
2) training words:
- 4GB of training data *
- net topology: work in progress …
- todo: use upcoming new caffe LSTM layers etc
- UPDATE LSTMs get rolling, still not merged
- UPDATE since the caffe project leaders have a hindering merging policy and this pull request was shifted many times without ever being merged, we are migrating to tensorflow
- todo: add extra categories for a) silence b) common noises like typing, achoo c) ALL other noises
3) training speech:
- todo!
- 100GB of training data here: http://www.openslr.org/12/
- TIMIT dataset $27,000.00 membership fee or $250 for non-members+$2400 under research-only license?
- combine with google n-grams
───
認識一下『Caffe』是何物︰
Caffe
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.
Check out our web image classification demo!
Why Caffe?
Expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.
Extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back. Thanks to these contributors the framework tracks the state-of-the-art in both code and models.
Speed makes Caffe perfect for research experiments and industry deployment. Caffe can process over 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference and 4 ms/image for learning. We believe that Caffe is the fastest convnet implementation available.
Community: Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Join our community of brewers on the caffe-users group and Github.
* With the ILSVRC2012-winning SuperVision model and caching IO. Consult performance details.
Documentation
- DIY Deep Learning for Vision with Caffe
Tutorial presentation. - Tutorial Documentation
Practical guide and framework reference. - arXiv / ACM MM ‘14 paper
A 4-page report for the ACM Multimedia Open Source competition (arXiv:1408.5093v1). - Installation instructions
Tested on Ubuntu, Red Hat, OS X. - Model Zoo
BVLC suggests a standard distribution format for Caffe models, and provides trained models. - Developing & Contributing
Guidelines for development and contributing to Caffe. - API Documentation
Developer documentation automagically generated from code comments.
───