Robust Automatic Recognition of Birdsongs and Human Speech: a Template-Based Approach
This dissertation focuses on robust signal processing algorithms for birdsongs and speech signals. Automatic phrase or syllable detection systems of bird sounds are useful in several applications. However, bird-phrase detection is challenging due to segmentation error, duration variability, limited training data, and background noise. Two spectrograms with identical class labels may look different due to time misalignment and frequency variation. In real recording environments such as in a forest, the data can be corrupted by background interference, such as rain, wind, other animals or even other birds vocalizing. A noise-robust classifier needs to handle such conditions. Similarly, Automatic Speech Recognition (ASR) works well in quiet environments, but a large degradation in performance is observed when the speech signal is corrupted by background noise. The ASR performance would benefit from robust representations of speech signals and from robust recognition systems.
The first topic of this dissertation focuses on an automatic birdsong-phrase recognition system that is robust to limited training data, class variability, and noise. The algorithm comprises a noise-robust Dynamic-Time-Warping (DTW)- based segmentation and a discriminative classifier for outlier rejection. The algorithm utilizes DTW and prominent (high energy) time-frequency regions of training spectrograms to derive a reliable noise-robust template for each phrase class. The resulting template is then used for segmenting continuous recordings to obtain segment candidates whose spectrogram amplitudes in the prominent regions are used as features to a Support Vector Machine (SVM). In addition, we present a novel approach to training HMMs with extremely limited data. First, the algorithm learns the Global Gaussian Mixture Models (GMMs) for all training phrases available. GMM parameters are then used to initialize state parameters of each individual model. The number of states and the mixture components for each state are determined by the acoustic variation of each phrase type. The (high-energy) time-frequency prominent regions are used to compute the state emitting probability to increase noise-robustness.
The second topic of the dissertation deals with noise-robust processing for automatic speech recognition. We also propose a new pitch-based spectral enhancement algorithm based on voiced frames for speech analysis and noise-robust speech processing. The proposed algorithm determines a time-warping function (TWF) and the speaker's pitch with high precision, simultaneously. This technique reduces the smearing effect in between harmonics when the fundamental frequency is not constant within the analysis window. To do so, we propose a metric called the harmonic residual which measures the difference between the actual spectrum and the resynthesized spectrum derived from the linear model of speech production with various combinations of TWF and high-precision pitch values as parameters. The TWF and pitch pair that yields the minimum harmonic residual is selected and the enhanced spectrum is obtained accordingly. We show how this new representation can be also used for automatic speech recognition by proposing a robust spectral representation derived from harmonic amplitude interpolation.