Automatic speech recognition (ASR) systems have improved significantly in the last decade due to advances in deep learning algorithms and easier access to very large databases. ASR systems, however, face two major challenges. The first challenge is in low-resource situations, such as child speech, where the accuracy degrades significantly, and the second challenge concerns low inference efficiency due to the autoregressive mechanism and size of ASR models. In this dissertation, we address these challenges by introducing novel techniques to improve the accuracy and the inference efficiency for ASR tasks, especially child ASR.
To address the accuracy challenge, we introduce novel self-supervised learning (SSL) methods using un-annotated adult speech data and explore how these methods can improve the downstream child ASR tasks. Specifically, a bidirectional autoregressive predictive coding (Bi-APC) method is proposed for non-causal models pretraining with the usage of adult speech data. The pretrained model is then finetuned on supervised child speech-text pairs. We also propose a novel framework, domain responsible adaptation and finetuning (DRAFT), to reduce the domain shifting in pretrained speech models. The DRAFT framework is effective for APC that uses a causal transformer as the backbone, and for Bi-APC, Wav2vec2.0 and HuBERT methods, which use a non-causal transformer as the backbone.
To address the inference efficiency challenge, we introduce a novel Connectionist Temporal Classification (CTC) Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. A comprehensive evaluation of CASS-NAT is performed in this dissertation. In CASS-NAT, the word embeddings in an autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted based on the encoder outputs and CTC alignments. TAE can be obtained simultaneously without recurrent operations, leading to a parallel generation of output sequences. In addition, an error-based alignment sampling method is proposed to reduce alignment mismatch between the training and inference. CASS-NAT achieves ~20x speed up during inference without significant performance degradation compared to AT. We also propose a CASS-NAT variant (UniEnc-CASSNAT) that consists of only an encoder module. Together with the proposed multi-pass CTC training and iterative decoding, UniEnc-CASSNAT can perform as well as CASS-NAT with fewer model parameters.
Beyond these two challenges, and in order to facilitate the development of better child ASR, we build the first child ASR benchmark for the research community. The benchmark includes comparisons of widely-used techniques for ASR such as data augmentation, parameter efficient finetuning (PEFT), self-supervised (HuBERT and WavLM) and supervised models (such as Whisper). All codes developed in this dissertation will be available.