Speech Normalization and Data Augmentation Techniques Based on Acoustical and Physiological Constraints and Their Applications to Child Speech Recognition
Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Speech Normalization and Data Augmentation Techniques Based on Acoustical and Physiological Constraints and Their Applications to Child Speech Recognition

Abstract

Recently, adult automatic speech recognition (ASR) system performance has improved dramatically. In contrast, the performance of child ASR systems remains inadequate in an era where demand for child speech technology is on the rise. While adult speech data is abundant, publicly available child speech data is sparse due, in part, to privacy concerns. Hence, many child ASR systems are trained using adult speech data. However, child ASR systems perform poorly when trained on adult speech due to the acoustic mismatch that results from body size differences, especially the vocal folds and the vocal tract, as well as the high variability of child speech.This research analyzes the acoustical properties of child speech across various ages and compares them to the acoustic properties of adult speech. Specifically, the subglottal resonances (SGRs), fundamental frequency (fo), and formant frequencies of vowel productions are investigated. These acoustic features are shown to be capable of predicting acoustic structures across speakers. As such, we propose feature extraction methods utilizing these properties to normalize the acoustic structure across speakers and reduce the acoustic mismatch between adult and child speech. This allows child ASR systems to leverage adult data for training and suggests a framework for a universal ASR system that need not be adult or child dependent. Furthermore, we demonstrate that when child speech data is limited, these feature normalization methods are capable of producing significant improvements in child ASR for both Gaussian mixture model (GMM) and deep neural network (DNN)-based systems.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View