The Voice Source in Speech Production: from Models to Applications
- Author(s): Chen, Gang
- Advisor(s): ALWAN, ABEER
- et al.
The voice source contains important lexical and non-lexical information. The non-lexical information can convey, for example, prosodic events, emotional status, as well as cues pertaining to the uniqueness of the speaker's voice. A better understanding, and eventually a better model of the voice source, would benefit various speech applications, such as speech recognition, speech synthesis, speaker identification, age/gender classification, as well as clinical assessments.
This dissertation has three main goals. The first is to better understand the voice source through analyzing images of the vocal folds using laryngeal high-speed videoendoscopy (HSV) recordings. A new automatic method is proposed to compactly summarize the overall spatial synchronization pattern of vocal fold vibration for the entire laryngeal area from HSV data. Additionally, a new measure is proposed to adequately capture perceptually-important variations in glottal area pulse shapes, which are extracted from HSV data.
The second goal is to study the acoustic consequence of a physiological vocal-fold vibration pattern---the glottal gap effect, and apply our findings to a gender classification task of children's voices. Voice source related measures are found to improve classification accuracy, especially for younger (10-15 year old) speakers.
The third goal is to propose new voice source models and evaluate them in different applications. In the first application, a new source model and a noise-robust automatic source estimation algorithm are proposed to estimate the voice source from speech signals. Results in both clean and noisy conditions show that the proposed model and algorithm are robust in accurately estimating the voice source signal. The second application is to use the proposed source model for vowel synthesis. Perceptual listening experiments show that the proposed model provides a better perceptual match to the target voice than do traditional models.