【鼎革‧革鼎】︰ Raspbian Stretch 《六之 K.3-言語界面-1 》

人類用著言語溝通自然而然,將之作為『人機界面』談何容易!記得最早與此領域結緣,是因參與輔助弱視者接觸網際網路之

螢幕閱讀器

螢幕閱讀器英語:screen reader)又稱為螢幕報讀軟體,是一種安裝於電腦上的應用程式軟體,用來將文字、圖形以及電腦介面的其他部分(藉文字轉語音(Text-To-Speech, TTS)技術)轉換成語音及/或點字。對於視障者或閱讀障礙者甚有助益,有些人會搭配放大軟體一齊使用。 螢幕閱讀器至少可以讀出:

當使用螢幕報讀軟體時,螢幕是否開啟並不會影響其運作,它本身並不構成一部電腦的要件,只是一個軟體或輸出裝置。

使用者在挑選螢幕報讀軟體時,通常會考量許多因素。包含使用平台 、成本、使用者偏好等,並也會受到其所屬組織(如慈善機構、學校 、任職公司)之影響,而螢幕報讀軟體在選擇上是備受爭議的。

自從Windows 2000以來,微軟作業系統已在其版本中加入名為Microsoft Narrator light-duty之螢幕報讀軟體。而蘋果公司則於其麥金塔作業系統中加入一功能豐富的螢幕報讀軟體VoiceOver。另一方面,Oralux Linux中則裝載了三種螢幕報讀軟體: Emacspeak, Yasr,以及Speakup。而開放軟體GNOME桌面環境則包含了Gnopernicus與Orca兩種螢幕報讀軟體。此外還有許多開放原始碼的螢幕報讀軟體,如適用於GNOME平台的Linux Screen Reader,以及NonVisual Desktop Access for Windows(NVDA)。

 

研究。當時祇覺得果然『電腦語音合成』有『機器味』!!想那鋼琴演奏,尚且依賴彈者之輕重緩急而生韻味,豈又是ㄅ一‧ㄅ一作響的『合成器』可以模擬?不知彼時之『醜小鴨』,今日是否蛻變成了『天鵝』呢??

反思或該趁早精通

PHYSICAL AUDIO SIGNAL PROCESSING FOR VIRTUAL MUSICAL INSTRUMENTS AND AUDIO EFFECTS

JULIUS O. SMITH III
Center for Computer Research in Music and Acoustics (CCRMA)

 

的吧☆

Voice Synthesis

Unquestionably, the most extensive prior work in the 20th century relevant to virtual acoustic musical instruments occurred within the field of speech synthesis [140,143,366,411,338,106,245].A.11 This research was driven by both academic interest and the potential practical benefits of speech compression to conserve telephone bandwidth. It was clear at an early point that the bandwidth of a telephone channel (nominally 200-3200 Hz) was far greater than the “information rate” of speech. It was reasoned, therefore, that instead of encoding the speech waveform, it should be possible to encode instead more slowly varying parameters of a good synthesis model for speech.

Before the 20th century, there were several efforts to simulate the voice mechanically, going back at least until 1779 [141].

……

Vocal Tract Analog Models

There is one speech-synthesis thread that clearly classifies under computational physical modeling, and that is the topic of vocal tract analog models. In these models, the vocal tract is regarded as a piecewise cylindrical acoustic tube. The first mechanical analogue of an acoustic-tube model appears to be a hand-manipulated leather tube built by Wolfgang von Kempelen in 1791, reproduced with improvements by Sir Charles Wheatstone [141]. In electrical vocal-tract analog models, the piecewise cylindrical acoustic tube is modeled as a cascade of electrical transmission line segments, with each cylindrical segment being modeled as a transmission line at some fixed characteristic impedance. An early model employing four cylindrical sections was developed by Hugh K. Dunn in the late 1940s [120]. An even earlier model based on two cylinders joined by a conical section was published by T. Chiba and M. Kajiyama in 1941 [120]. Cylinder cross-sectional areas A_i  were determined based on X-ray images of the vocal tract, and the corresponding characteristic impedances were proportional to \frac{1}{A_i}  . An impedance-based, lumped-parameter approximation to the transmission-line sections was used in order that analog LC ladders could be used to implement the model electronically. By the 1950s, LC vocal-tract analog models included a side-branch for nasal simulation [132].

The theory of transmission lines is credited to applied mathematician Oliver Heaviside (1850-1925), who worked out the telegrapher’s equations (sometime after 1874) as an application of Maxwell’s equations, which he simplified (sometime after 1880) from the original 20 equations of Maxwell to the modern vector formulation.A.12 Additionally, Heaviside is credited with introducing complex numbers into circuit analysis, inventing essentially Laplace-transform methods for solving circuits (sometime between 1880 and 1887), and coining the terms `impedance’ (1886), `admittance‘ (1887), `electret’, `conductance’ (1885), and `permeability’ (1885). A little later, Lord Rayleigh worked out the theory of waveguides (1897), including multiple propagating modes and the cut-off phenomenon.A.13

………

Formant Synthesis Models

A formant synthesizer is a source-filter model in which the source models the glottal pulse train and the filter models the formant resonances of the vocal tract. Constrained linear prediction can be used to estimate the parameters of formant synthesis models, but more generally, formant peak parameters may be estimated directly from the short-time spectrum (e.g., [257]). The filter in a formant synthesizer is typically implemented using cascade or parallel second-order filter sections, one per formant. Most modern rule-based text-to-speech systems descended from software based on this type of synthesis model [257,258,259].

Another type of formant-synthesis method, developed specifically for singing-voice synthesis is called the FOF method [389]. It can be considered an extension of the VOSIM voice synthesis algorithm [220]. In the FOF method, the formant filters are implemented in the time domain as parallel second-order sections; thus, the vocal-tract impulse response is modeled as a sum of three or so exponentially decaying sinusoids. Instead of driving this filter with a glottal pulse wave, a simple impulse is used, thereby greatly reducing computational cost. A convolution of an impulse response with an impulse train is simply a periodic superposition of the impulse response. In the VOSIM algorithm, the impulse response was trimmed to one period in length, thereby avoiding overlap and further reducing computations.

The FOF method also tapers the beginning of the impulse-response using a rising half-cycle of a sinusoid. This qualitatively reduces the “buzziness” of the sound, and compensates for having replaced the glottal pulse with an impulse. In practice, however, the synthetic signal is matched to the desired signal in the frequency domain, and the details of the onset taper are adjusted to optimize audio quality more generally, including to broaden the formant resonances.

One of the difficulties of formant synthesis methods is that formant parameter estimation is not always easy [411]. The problem is particularly difficult when the fundamental frequency <img src=” width=”25″ height=”34″ align=”MIDDLE” border=”0″ /> is so high that the formants are not adequately “sampled” by the harmonic frequencies, such as in high-pitched female voice samples. Formant ambiguities due to insufficient spectral sampling can often be resolved by incorporating additional physical constraints to the extent they are known.

Formant synthesis is an effective combination of physical and spectral modeling approaches. It is a physical model in that there is an explicit division between glottal-flow wave generation and the formant-resonance filter, despite the fact that a physical model is rarely used for either the glottal waveform or the formant resonator. On the other hand, it is a spectral modeling method in that its parameters are estimated by explicitly matching short-time audio spectra of desired sounds. It is usually most effective for any synthesis model, physical or otherwise, to be optimized in the “audio perception” domain to the extent it is known how to do this [315,166]. For an illustrative example, see, e.g., [202].

 

且藉『語音合成』

Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1]

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely “synthetic” voice output.[2]

The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood clearly. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written words on a home computer. Many computer operating systems have included speech synthesizers since the early 1990s.

Overview of a typical TTS system

A text-to-speech system (or “engine”) is composed of two parts:[3] a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations),[4] which is then imposed on the output speech.

 

作個故事接續