分類彙整: 樹莓派之教育

9va-pi ︰ 歲末跨年

聽聞今年牛津年度『風雲字』是

 

喜極而泣

 

,一個喜極而泣的表情符號。若問這是一個『字』嗎?或許凡是『傳情達意』之符號,都可為『文』。既然天下能識能用,為什麼不可以說是個『字』的呢!曾經米雕風光一時,

米雕是一種在大米上寫字,畫畫並配飾成的飾品。米雕興旺於2000年後,由哈爾濱一民間藝人創立(以前街頭藝人稱之為米上刻字,源起何時不詳)。民間有傳說,但無正史可考。傳說宋徽宗年間有一趕窮考秀才進京趕考,名落孫山,盤纏用完了饑渴之極突發奇想 ,當街用糯米在其上寫人名和一個福字,不想求字之人甚多,收入頗豐。一年後居然成為一名米糧商人,富錦還鄉。

 

米里乾坤

 

,那『米里乾坤』是否為藝術的耶?!就像有人說『歲末』,有人講『跨年』,莫非這『歲』與『年』竟成了異物,說講不到一塊去了乎!?

所以『形聲字譜』能形大千世界之狀,能象自然萬有之聲,也能譜寰宇眾生之情。將之用於『溝通彼此』豈非備矣哉。

 

 

 

 

 

 

 

 

 

 

 

 

9va-pi ︰ Gnu Speeh 編譯安裝

若想了解『 gnuspeech 』是什麼?最好聽聽官網怎麼講︰

What is gnuspeech?

gnuspeech makes it easy to produce high quality computer speech output, design new language databases, and create controlled speech stimuli for psychophysical experiments. gnuspeechsa is a cross-platform module of gnuspeech that allows command line, or application-based speech output. The software has been released as two tarballs that are available in the project Downloads area of http://savannah.gnu.org/projects/gnuspeech. Those wishing to contribute to the project will find the OS X (gnuspeech) and CMAKE (gnuspeechsa) sources in the Git repository on that same page. The gnuspeech suite still lacks some of the database editing components (see the Overview diagram below) but is otherwise complete and working, allowing articulatory speech synthesis of English, with control of intonation and tempo, and the ability to view the parameter tracks and intonation contours generated. The intonation contours may be edited in various ways, as described in the Monet manual. Monet provides interactive access to the synthesis controls. TRAcT provides interactive access to the underlying tube resonance model that converts the parameters into sound by emulating the human vocal tract.

The suite of programs uses a true articulatory model of the vocal tract and incorporates models of English rhythm and intonation based on extensive research that sets a new standard for synthetic speech.

The original NeXT computer implementation is complete, and is available from the NeXT branch of the SVN repository linked above. The port to GNU/Linux under GNUStep, also in the SVN repository under the appropriate branch, provides English text-to-speech capability, but parts of the database creation tools are still in the process of being ported.

Credits for research and implementation of the gnuspeech system appear the section Thanks to those who have helped below. Some of the features of gnuspeech, with the tools that are part of the software suite, tools include:

  • A Tube Resonance Model (TRM) for the human vocal tract (also known as a transmission-line analog, or a waveguide model) that truly represents the physical properties of the tract, including the energy balance between the nasal and oral cavities as well as the radiation impedance at lips and nose.
  • A TRM Control Model, based on formant sensitivity analysis, that provides a simple, but accurate method of low-level articulatory control. By using the Distinctive Region Model (DRM) only eight slowly varying tube section radii need be specified. The glottal (vocal fold) waveform and various suitably “coloured” random noise signals may be injected at appropriate places to provide voicing, aspiration, frication and noise bursts.
  • Databases which specify: the characteristics of the articulatory postures (which loosely correspond to phonemes); rules for combinations of postures; and information about voicing, frication and aspiration. These are the data required to produce specific spoken languages from an augmented phonetic input. Currently, only the database for the English language exists, though French vowel postures are also included.
  • A text-to-augmented-phonetics conversion module (the Parser) to convert arbitrary text, preferably incorporating normal punctuation, into the form required for applying the synthesis methods.
  • Models of English rhythm and intonation based on extensive researchthat are automatically applied.
  • “Monet”—a database creation and editing system, with a carefully designed graphical user interface (GUI) that allows the databases containing the necessary phonetic data and dynamic rules to be set up and modified in order that the computer can “speak” arbitrary languages.
  • A 70,000+ word English Pronouncing Dictionary with rules for derivatives such as plurals, and adverbs, and including 6000 given names. The dictionary also provides part-of-speech information to faciltate later addition of grammatical parsing that can further improve the excellent pronunciation, rhythm and intonation .
  • Sub-dictionaries that allow different user- or application-specific pronunciations to be substituted for the default pronunciations coming from the main dictionary (not yet ported).
  • Letter-to-sound rules to deal with words that are not in the dictionaries
  • A parser to organise the input and deal with dates, numbers, abbreviations, etc.
  • Tools for managing the dictionary and carrying out analysis of speech.
  • “Synthesizer”—a GUI-based application to allow experimentation with a stand-alone TRM. All parameters, both static and dynamic, may be varied and the output can be monitored and analysed. It is an important component in the research needed to create the databases for target languages.

tts-block-diagram

Overview of the main Articulatory Speech Synthesis System

 

Why is it called gnuspeech?

It is a play on words. This is a new (g-nu) “event-based” approach to speech synthesis from text, that uses an accurate articulatory model rather than a formant-based approximation. It is also a GNU project, aimed at providing high quality text-to-speech output for GNU/Linux, Mac OS X, and other platforms. In addition, it provides comprehensive tools for psychophysical and linguistic experiments as well as for creating the databases for arbitrary languages.

What is the goal of the gnuspeech project?

The goal of the project is to create the best speech synthesis software on the planet.

 

由於作者沒有 MAC OSX 的環境,此處僅僅依據 gnuspeechsa-0.1.5.tar.gz 內之INSTALL 文件,驗證樹莓派上的安裝如下︰

 

mkdir gnuspeech
cd gnuspeech/

# 取得軟體
wget http://ftp.gnu.org/gnu/gnuspeech/gnuspeechsa-0.1.5.tar.gz
tar -zxvf gnuspeechsa-0.1.5.tar.gz 

# 編譯及安裝
cd gnuspeechsa-0.1.5/
pkg_dir=PWD mkdir ../GnuspeechSA-build cd ../GnuspeechSA-build cmake -D CMAKE_BUILD_TYPE=Releasepkg_dir
make
sudo make install
sudo ldconfig

# 測試
./gnuspeech_sa -c $pkg_dir/data/en -p /tmp/test_param.txt -o /tmp/test.wav "He
llo world." && aplay -q /tmp/test.wav

 

有關這個程式的簡介,讀者可以參考

README

GnuspeechSA (Stand-Alone)
==========================

GnuspeechSA is a port to C++/C of the TTS_Server in the original Gnuspeech (http://www.gnu.org/software/gnuspeech/) source code written for NeXTSTEP.
It is a command-line program that converts text to speech.

This project is based on code from Gnuspeech SVN, rev. 672, downloaded in 2014-08-02. The source code was obtained from the directories:

nextstep/trunk/ObjectiveC/Monet.realtime
nextstep/trunk/src/SpeechObject/postMonet/server.monet

This software is part of Gnuspeech.

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the COPYING file for more details.

 

雖說 gnuspeech-0.9.tar.gz 的編譯需要 MAC OSX 的環境,但是其中有些重要文件值得一讀,因此也建議讀者取得︰

wget http://ftp.gnu.org/gnu/gnuspeech/gnuspeech-0.9.tar.gz

至於要怎麼玩,尚請讀者自行方便。作者亦是新手,或容改日再談的了。

 

 

 

 

 

 

 

 

 

 

9va-pi ︰ Gnu Speeh 初版發行

如果蟲鳴鳥叫是天生本能,那麼人類講話也是天賦自然。但是萬物發聲之『物理模型』卻很難建造。因此

Gnuspeech

Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory speech synthesizer; uses these to drive an articulatory model of the human vocal tract producing an output suitable for the normal sound output devices used by various computer operating systems; and does this at the same or faster rate than the speech is spoken for adult speech.

Design

The synthesizer is a tube resonance, or waveguide, model that models the behavior of the real vocal tract directly, and reasonably accurately, unlike formant synthesizers that indirectly model the speech spectrum.[1] The control problem is solved by using René Carré’s Distinctive Region Model[2] which relates changes in the radii of eight longitudinal divisions of the vocal tract to corresponding changes in the three frequency formants in the speech spectrum that convey much of the information of speech. The regions are, in turn, based on work by the Stockholm Speech Technology Laboratory[3] of the Royal Institute of Technology (KTH) on “formant sensitivity analysis” – that is, how formant frequencies are affected by small changes in the radius of the vocal tract at various places along its length.[4]

 

或許代表一種『聲音合成』的未來。其中『聲道』

Vocal tract

The vocal tract is the cavity in human beings and in animals where sound that is produced at the sound source (larynx in mammals; syrinx in birds) is filtered.

In birds it consists of the trachea, the syrinx, the oral cavity, the upper part of the esophagus, and the beak. In mammals it consists of the laryngeal cavity, the pharynx, the oral cavity, and the nasal cavity.

The estimated average length of the vocal tract in adult male humans is 16.9 cm and 14.1 cm in adult females.[1]

400px-Sagittalmouth

Sagittal section of human vocal tract

 

模型就是型塑萬物音聲特色的基礎。欣聞

 

Initial release of gnuspeech available

From: David Hill <drh-AT-firethorne.com>
To: Gnu Announce <info-gnu-AT-gnu.org>
Subject: First release of gnuspeech project software
Date: Mon, 19 Oct 2015 18:41:22 -0700
Message-ID: <AD48546B-E89C-4F7C-A2C5-D45D5C3C46A3@firethorne.com>
Archive-link: Article, Thread

gnuspeech-0.9 and gnuspeechsa-0.1.5 first official release

Gnuspeech is new approach to synthetic speech as well as a speech research tool. It comprises a true articulatory model of the vocal tract, databases and rules for parameter composition, a 70,000 word plus pronouncing dictionary, a letter-to-sound fall-back module, and models of English rhythm and intonation, all based on extensive research that sets a new standard for synthetic speech, and computer-based speech research.

There are two main components in this first official release. For those who would simply like speech output from whatever system they are using, including incorporating speech output in their applications, there is the gnuspeechsa tarball (currently 0.1.5), a cross-platform speech synthesis application, compiled using CMake.

For those interested in an interactive system that gives access to the underlying algorithms and databases involved, providing an understanding of the mechanisms, databases, and output forms involved, as well as a tool for experiment and new language creation, there is the gnuspeech tarball (currently 0.9) that embodies several sub-apps, including the interactive database creation system Monet (My Own Nifty Editing Tool), and TRAcT (the Tube Resonance Access Tool) — a GUI interface to the tube resonance model used in gnuspeech, that emulates the human vocal tract and provides the basis for an accurate rendition of human speech.

This second tarball includes full manuals on both Monet and TRAcT. The Monet manual covers the compilation and installation of gnuspeechsa on a Macintosh under OS X 10.10.x, and references the related free software that allows the speech to be incorporated in applications. Appendix D of the Monet manual provides some additional information about gnuspeechsa and associated software that is available, and details how to compile it using CMake on the Macintosh under 10.10.x (Yosemite).

The digitally signed tarballs may be accessed at

http://ftp.gnu.org/gnu/gnuspeech/
There is a list of mirrors at http://www.gnu.org/order/ftp.html and the site http://ftpmirror.gnu.org/gnuspeech will redirect to a nearby mirror

A longer project description and credits may be found at: http://www.gnu.org/software/gnuspeech/
which is also linked to a brief (four page) project history/component description, and a paper on the Tube Resonance Model by Leonard Manzara.
Signed: David R Hill
———————–
drh@firethorne.com

http://www.gnu.org/software/gnuspeech/

http://savannah.gnu.org/projects/gnuspeech

https://savannah.gnu.org/users/davidhill

 

,不過眼前恐得了解編譯安裝之法。

 

 

 

 

 

 

 

 

 

 

 

9va-pi ︰ 語音合成

什麼是『語音合成Speech Synthesis 的呢?維基百科上說︰

語音合成是將人類語音用人工的方式所產生。若是將電腦系統用在語音合成上,則稱為語音合成器,而語音合成器可以用軟/硬體所實現。文字轉語音(text-to-speech,TTS)系統則是將一般語言的文字轉換為語音,其他的系統可以描繪語言符號的表示方式,就像音標轉換至語音一 樣。

而合成後的語音則是利用在資料庫內的許多已錄好的語音連接起來 。系統則因為儲存的語音單元大小不同而有所差異,若是要儲存phone以及 diphone的話,系統必須提供大量的儲存空間,但是在語意上或許會不清楚。而用在特定的使用領域上,儲存整字或整句的方式可以達到高品質的語音輸出。 另外,包含了聲道模型以及其他的人類聲音特徵參數的合成器則可以創造出完整的合成聲音輸出。

一個語音合成器的品質通常是決定於人聲的相似度以及語意是否能被了解。一個清晰的文字轉語音程式應該提供人類在視覺受到傷害或是得到失讀症時,能夠聽到並且在個人電腦上完成工作。從80年代早期開始,許多的電腦作業系統已經包含了語音合成器了。

 

Speech Synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1]

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely “synthetic” voice output.[2]

The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood clearly. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1990s.

A text-to-speech system (or “engine”) is composed of two parts:[3] a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations),[4] which is then imposed on the output speech.

 

550px-TTS_System.svg

 

想起最早接觸是在『盲人電腦系統』中之『螢幕閱讀器』軟體裡。依稀記得那時聲音聽起來像 R2-D2 般的有電腦味。就像是『標音』調校的不好的

eSpeak

eSpeak is derived from the “Speak” speech synthesizer for British English for Acorn RISC OS computers which was originally written in 1995 by Jonathan Duddington.

A rewritten version for Linux appeared in February 2006 and a Windows SAPI 5 version in January 2007. Subsequent development has added and improved support for additional languages.

Because of infrequent updates for last few years several espeak forks had emerged on github.[3] After discussions on espeak’s discussion list,[4][5] espeak-ng fork managed by Reece Dunn was decided as a new canonical place of espeak further development.

Because of its small size and many languages, it is included as the default speech synthesizer in the NVDA open source screen reader for Windows, and on the Ubuntu and other Linux installation discs.

The quality of the language voices varies greatly. Some have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

 

據聞卡內基美隆大學 Carnegie Mellon University 的『歡宴』

Festival

Welcome to festvox.org
This project is part of the work at Carnegie Mellon University’s speech group aimed at advancing the state of Speech Synthesis.

  • 14th February 2015: Indic voice release (Hindi, Marathi, Tamil and Telugu) and on-line demos
  • 25th December 2014: A suite of new releases:

    There is a general script that shows what you need to download, compile and run to use these new versions.

The Festvox project aims to make the building of new synthetic voices more systemic and better documented, making it possible for anyone to build a new voice. Specifically we offer:

  • Documentation, including scripts explaining the background and specifics for building new voices for speech synthesis in new and supported languages.
  • Specific scripts to build new voices in supported languages, such as US and UK English.
  • Aids to building synthetic voices for limited domains
  • Example speech databases to help building new voices.
  • Links, demos and a repository for new voices

The documentation, tools and dependent software are all free without restriction (commercial or otherwise). Licencing of voices built by these techniques are the responsibility of the builders.This work is firmly grounded within Edinburgh University’s Festival Speech Synthesis System and Carnegie Mellon University’s small footprint Flite synthesis engine

This work has been supported be various groups including, Carnegie Mellon University, the US National Science Foundation (NSF), and US Defense Advanced Research Projects Agency (DARPA).

Requirements for building a voice
Note the techniques and processes described here do not guarantee that you’ll end up with a high quality acceptable voice, but with a little care you can likely build a new synthesis voice in a supported language in a few days, or in a new language in a few weeks (more or less depending on the complexity of the language, and the desired quality).You will need:

  • To read the documentation
  • A Unix machine (e.g. Linux, FreeBSD, Solaris, etc) with working audio i/o. This may work on other platforms but many scripts, perhaps unnecessarily, depend on Unix utilties like, awk, sed etc.
  • Installed versions of Edinburgh University’s Festival Speech Synthesis System and Edinburgh Speech Tools (distributed with Festival).
  • A waveform viewing/labeling program like emulabel distributed as part of Macquarie University’s EMU speech database system. Although automatic labeling software is included in festvox, a display tool is necessary for diagnosis and debugging.
  • Patience and care, and a little interest in the subject of speech technology.

 

語音合成軟體,已經很有人情味的了。這兩套軟體 raspbian jessie 都有,有興趣的讀者可以自行安裝玩玩 。

作者一路追蹤 M♪o 的步伐,終達探尋『關節‧接合‧清晰發音』

Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which usually involves modifying the position of the speech articulators, such as the tongue, jaw, and lips. Speech is created by digitally simulating the flow of air through the representation of the vocal tract.

 

 

『形聲字譜』合成系統之玄機矣!!??