W!o+ 的《小伶鼬工坊演義》︰神經網絡【FFT】七

如果了解了『頻譜圖

Spectrogram

A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time or some other variable. Spectrograms are sometimes called spectral waterfalls, voiceprints, or voicegrams.

Spectrograms can be used to identify spoken words phonetically, and to analyse the various calls of animals. They are used extensively in the development of the fields of music, sonar, radar, and speech processing,[1] seismology, etc.

The instrument that generates a spectrogram is called a spectrograph.

The sample outputs on the right show a select block of frequencies going up the vertical axis, and time on the horizontal axis.

Spectrogram-19thC

Typical spectrogram of the spoken words “nineteenth century”. The lower frequencies are more dense because it is a male voice. The legend to the right shows that the color intensity increases with the density.

 

1280px-Spectrogram_of_violin

Spectrogram of

      1. the actual recording of this violin playing
. Note the harmonics occurring at whole-number multiples of the fundamental frequency. Note the fourteen draws of the bow, and the visual differences in the tones.

 

Spectrogram

3D surface spectrogram of a part from a music piece.

Format

A common format is a graph with two geometric dimensions: the horizontal axis represents time or rpm, the vertical axis is frequency; a third dimension indicating the amplitude of a particular frequency at a particular time is represented by the intensity or color of each point in the image.

There are many variations of format: sometimes the vertical and horizontal axes are switched, so time runs up and down; sometimes the amplitude is represented as the height of a 3D surface instead of color or intensity. The frequency and amplitude axes can be either linear or logarithmic, depending on what the graph is being used for. Audio would usually be represented with a logarithmic amplitude axis (probably in decibels, or dB), and frequency would be linear to emphasize harmonic relationships, or logarithmic to emphasize musical, tonal relationships.

Generation

Spectrograms are usually created in one of two ways: approximated as a filterbank that results from a series of bandpass filters (this was the only way before the advent of modern digital signal processing), or calculated from the time signal using the FFT. These two methods actually form two different Time-Frequency Distributions, but are equivalent under some conditions.

Spectrogram and waterfall of a 8MHz wide PAL-I Television signal.

The bandpass filters method usually uses analog processing to divide the input signal into frequency bands; the magnitude of each filter’s output controls a transducer that records the spectrogram as an image on paper.[2]

Creating a spectrogram using the FFT is a digital process. Digitally sampled data, in the time domain, is broken up into chunks, which usually overlap, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk. Each chunk then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time. The spectrums or time plots are then “laid side by side” to form the image or a three-dimensional surface,[3] or slightly overlapped in various ways, windowing.

The spectrogram of a signal s(t) can be estimated by computing the squared magnitude of the STFT of the signal s(t), as follows:[4]

\mathrm{spectrogram}(t,\omega)=\left|\mathrm{STFT}(t,\omega)\right|^2

───

 

也許可以嘗試讀讀應用範例︰

Speech Recognition with BVLC caffe

Speech Recognition with the caffe deep learning framework

UPDATE: We are migrating to tensorflow

This project is quite fresh and only the first of three milestones is accomplished: Even now it might be useful if you just want to train a handful of commands/options (1,2,3..yes/no/cancel/…)

1) training spoken numbers:

  • get spectogram training images from http://pannous.net/spoken_numbers.tar (470 MB)
  • start ./train.sh
  • test with ipython notebook test-speech-recognition.ipynb or caffe test ... or <caffe-root>/python/classify.py
  • 99% accuracy, nice!
  • online recognition and learning with ./recognition-server.py and ./record.py scripts

Sample spectrogram, That's what she said, too laid?

Sample spectrogram, Karen uttering ‘zero’ with 160 words per minute.

2) training words:

  • 4GB of training data *
  • net topology: work in progress …
  • todo: use upcoming new caffe LSTM layers etc
  • UPDATE LSTMs get rolling, still not merged
  • UPDATE since the caffe project leaders have a hindering merging policy and this pull request was shifted many times without ever being merged, we are migrating to tensorflow
  • todo: add extra categories for a) silence b) common noises like typing, achoo c) ALL other noises

3) training speech:

───

 

認識一下『Caffe』是何物︰

Caffe

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.

Check out our web image classification demo!

Why Caffe?

Expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.

Extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back. Thanks to these contributors the framework tracks the state-of-the-art in both code and models.

Speed makes Caffe perfect for research experiments and industry deployment. Caffe can process over 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference and 4 ms/image for learning. We believe that Caffe is the fastest convnet implementation available.

Community: Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Join our community of brewers on the caffe-users group and Github.

* With the ILSVRC2012-winning SuperVision model and caching IO. Consult performance details.

Documentation

───