Combining Speech and Speaker Recognition - A Joint Modeling Approach
- Author(s): Su, Hang
- Advisor(s): Morgan, Nelson
- et al.
Automatic speech recognition (ASR) and speaker recognition (SRE) are two important fields of research in speech technology. Over the years, many efforts have been made on improving recognition accuracies on both tasks, and many different technologies have been developed. Given the close relationship between these two tasks, researchers have proposed different ways to introduce techniques developed for these tasks to each other.
In the first half of this thesis, I explore ways to improve speaker recognition performance using state-of-the-art speech recognition acoustic models, and then research alternative ways to perform speaker adaptation of deep learning models for ASR using speaker identity vector (i-vector). Experiments from this work shows that ASR and SRE are beneficial to each other and can be used to improve their performance.
In the second part of the thesis, I aim to build joint model for speech and speaker recognition. To implement this idea, I first build an open-source experimental framework, TIK, that connects well-known deep learning toolkit Tensorflow and speech recognition toolkit Kaldi. After reproducing state-of-the-art speech and speaker recognition performance using TIK, I then developed a unified model, JointDNN, that is trained jointly for speech and speaker recognition. Experimental results show that the joint model can effectively perform ASR and SRE tasks. In particular, experiments show that the JointDNN model is more effective in speaker recognition than x-vector system, given a limited amount of training data.