The potential of speech technology to improve educational outcomes has been a topic of great interest in recent years. For example, automatic speech recognition (ASR) systems could be employed to provide kindergarten-aged children with real-time feedback on their literacy and pronunciation as they practice reading aloud. Within these systems, speaker identification (SID) technology could additionally be used to identify the user’s speaker characteristics in order to ensure that they receive age, language, and dialect-appropriate feedback. While these technologies are more established for well-represented groups in STEM (ie. able-bodied, adult, first-language speakers of mainstream dialects), they give much worse performance for underrepresented groups (young children, speakers of non-mainstream dialects, people with speech-related disabilities, etc.). This work focuses on improving speech technology performance for children’s speech and African American English (AAE) dialect speech with the goal of creating more equitable outcomes in early education. The contributions of this work span three primary areas: 1) Dialect identification and density scoring, 2) data augmentation for speech recognition, and 3) Natural Language Processing for fair and inclusive automatic speech assessment.
First, we create a robust system for dialect identification of African American English for both children and adult's speech. This system aims to take an input utterance from a speaker of either African American English or Mainstream American English and determine which of the two dialects the utterance belongs. The system fuses features from paralinguistics, self-supervised learning representations, automatic speech recognition system outputs, prosodic contours, and other descriptors of the speech signal in order to learn a mapping from the input acoustic information to a dialect classification decision. We further explore this architecture in automatic dialect density estimation, a task we create and develop. In dialect density scoring, we train a system to automatically predict a speaker's frequency of usage of dialect-specific patterns. This information can then be passed to a speech recognition system for more dialect-informed processing.
Second, we develop a data augmentation algorithm to improve zero-shot and few-shot speech recognition of low-resource dialects. The algorithm, named LPCAugment, deconstructs an input speech signal into a source and filter representation using linear predictive coding (LPC) analysis. The poles of the filter representation can then be perturbed independently of the source representation in order to model formant shifts that may be seen across accents and dialects. We use this perturbation method to artificially generate speech samples with shifted formant locations to serve as additional training data for a speech recognition system. This speech recognition system is then evaluated on children's speech for child speakers of a Southern California dialect and child speakers of an Atlanta, Georgia, area dialect.
Third, we explore automatic analysis and scoring of speech recognition transcripts for educational assessments. Given information about a student's spoken dialect and automatically generated transcripts of their oral response to an assessment prompt, we train a system to automatically grade the quality of the response with respect to a pre-determined criterion. This system uses language modeling and spoken information retrieval to identify key features in the spoken response and holistically decide if the response aligns with the grading criteria. Combined, the steps in this work form a framework for inclusive spoken language understanding technology that can be used to perform provide students with dialect-appropriate language training or language assessment.