This thesis aims to achieve better automatic speech recognition (ASR) for children. The most challenging problem is a lack of transcribed available databases, and thus this problem could be regarded as a low-resource task. We introduce this problem from three aspects. First, compared to adults' speech, larger intra- and inter-speaker variabilities in children's speech exacerbate the low-resource problem due to the different growth patterns of children's vocal tracts. Second, there are children's speech data on the Internet untranscribed. Thus, exploring how to utilize such untranscribed data to improve children's ASR is significant but challenging. Last, lacking training data will cause inadequate training when using random model initialization. Therefore, finding a good model initialization is important for training a robust children's ASR model in a low-resource setting.
We improve the performance of children's ASR systems from the aforementioned three aspects. First, we compare multiple effective data augmentation methods for children's ASR. On the OGI Kids' Corpus, we can achieve a WER reduction of around 10 % for the HMM-BLSTM modeling-based hybrid ASR system and around 25 % for the Connectionist Temporal Classification - Attention-based Encoder-Decoder (CTC-AED) modeling-based end-to-end ASR system. Second, unsupervised pre-training and semi-supervised learning are used as two effective methods for utilizing untranscribed data. Using one of the unsupervised pre-training methods, bidirectional autoregressive predictive coding, and 3 iterations of semi-supervised learning could bring a WER reduction of 9.6 % with 60 hours of untranscribed data. Third, model-agnostic meta-learning (MAML) based meta-initialization (MI) was used to find a good model initialization. However, MI is vulnerable to overfitting on training tasks (learner overfitting). To alleviate learner overfitting, an age-based task-level augmentation method is proposed. After using task-level augmentation methods with MI, the children's ASR system is able to achieve a WER reduction of 51 % in kindergarten-aged speech over no augmentation or initialization.