9va-pi ︰ 語音合成

什麼是『語音合成Speech Synthesis 的呢?維基百科上說︰

語音合成是將人類語音用人工的方式所產生。若是將電腦系統用在語音合成上,則稱為語音合成器,而語音合成器可以用軟/硬體所實現。文字轉語音(text-to-speech,TTS)系統則是將一般語言的文字轉換為語音,其他的系統可以描繪語言符號的表示方式,就像音標轉換至語音一 樣。

而合成後的語音則是利用在資料庫內的許多已錄好的語音連接起來 。系統則因為儲存的語音單元大小不同而有所差異,若是要儲存phone以及 diphone的話,系統必須提供大量的儲存空間,但是在語意上或許會不清楚。而用在特定的使用領域上,儲存整字或整句的方式可以達到高品質的語音輸出。 另外,包含了聲道模型以及其他的人類聲音特徵參數的合成器則可以創造出完整的合成聲音輸出。

一個語音合成器的品質通常是決定於人聲的相似度以及語意是否能被了解。一個清晰的文字轉語音程式應該提供人類在視覺受到傷害或是得到失讀症時,能夠聽到並且在個人電腦上完成工作。從80年代早期開始,許多的電腦作業系統已經包含了語音合成器了。

 

Speech Synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1]

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely “synthetic” voice output.[2]

The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood clearly. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1990s.

A text-to-speech system (or “engine”) is composed of two parts:[3] a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations),[4] which is then imposed on the output speech.

 

550px-TTS_System.svg

 

想起最早接觸是在『盲人電腦系統』中之『螢幕閱讀器』軟體裡。依稀記得那時聲音聽起來像 R2-D2 般的有電腦味。就像是『標音』調校的不好的

eSpeak

eSpeak is derived from the “Speak” speech synthesizer for British English for Acorn RISC OS computers which was originally written in 1995 by Jonathan Duddington.

A rewritten version for Linux appeared in February 2006 and a Windows SAPI 5 version in January 2007. Subsequent development has added and improved support for additional languages.

Because of infrequent updates for last few years several espeak forks had emerged on github.[3] After discussions on espeak’s discussion list,[4][5] espeak-ng fork managed by Reece Dunn was decided as a new canonical place of espeak further development.

Because of its small size and many languages, it is included as the default speech synthesizer in the NVDA open source screen reader for Windows, and on the Ubuntu and other Linux installation discs.

The quality of the language voices varies greatly. Some have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

 

據聞卡內基美隆大學 Carnegie Mellon University 的『歡宴』

Festival

Welcome to festvox.org
This project is part of the work at Carnegie Mellon University’s speech group aimed at advancing the state of Speech Synthesis.

  • 14th February 2015: Indic voice release (Hindi, Marathi, Tamil and Telugu) and on-line demos
  • 25th December 2014: A suite of new releases:

    There is a general script that shows what you need to download, compile and run to use these new versions.

The Festvox project aims to make the building of new synthetic voices more systemic and better documented, making it possible for anyone to build a new voice. Specifically we offer:

  • Documentation, including scripts explaining the background and specifics for building new voices for speech synthesis in new and supported languages.
  • Specific scripts to build new voices in supported languages, such as US and UK English.
  • Aids to building synthetic voices for limited domains
  • Example speech databases to help building new voices.
  • Links, demos and a repository for new voices

The documentation, tools and dependent software are all free without restriction (commercial or otherwise). Licencing of voices built by these techniques are the responsibility of the builders.This work is firmly grounded within Edinburgh University’s Festival Speech Synthesis System and Carnegie Mellon University’s small footprint Flite synthesis engine

This work has been supported be various groups including, Carnegie Mellon University, the US National Science Foundation (NSF), and US Defense Advanced Research Projects Agency (DARPA).

Requirements for building a voice
Note the techniques and processes described here do not guarantee that you’ll end up with a high quality acceptable voice, but with a little care you can likely build a new synthesis voice in a supported language in a few days, or in a new language in a few weeks (more or less depending on the complexity of the language, and the desired quality).You will need:

  • To read the documentation
  • A Unix machine (e.g. Linux, FreeBSD, Solaris, etc) with working audio i/o. This may work on other platforms but many scripts, perhaps unnecessarily, depend on Unix utilties like, awk, sed etc.
  • Installed versions of Edinburgh University’s Festival Speech Synthesis System and Edinburgh Speech Tools (distributed with Festival).
  • A waveform viewing/labeling program like emulabel distributed as part of Macquarie University’s EMU speech database system. Although automatic labeling software is included in festvox, a display tool is necessary for diagnosis and debugging.
  • Patience and care, and a little interest in the subject of speech technology.

 

語音合成軟體,已經很有人情味的了。這兩套軟體 raspbian jessie 都有,有興趣的讀者可以自行安裝玩玩 。

作者一路追蹤 M♪o 的步伐,終達探尋『關節‧接合‧清晰發音』

Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which usually involves modifying the position of the speech articulators, such as the tongue, jaw, and lips. Speech is created by digitally simulating the flow of air through the representation of the vocal tract.

 

 

『形聲字譜』合成系統之玄機矣!!??