Robust Speech and Bird Song Processing using Multi-band Correlograms and Sparse Representations
- Author(s): Tan, Lee Ngee
- Advisor(s): Alwan, Abeer
- et al.
This dissertation focuses on algorithms for robust speech and bird song processing. Many applications perform well under ideal signal conditions, e.g. noise-free, full bandwidth, sufficient training data. However, a large degradation in performance is generally observed when the input signal condition deviates from these ideal conditions. This dissertation describes robust algorithms for three applications, namely human-pitch detection, automatic speech recognition, and birdsong phrase classification. In the first application, a noise-robust, multi-band summary correlogram (MBSC)-based pitch detector is proposed. Novel signal processing schemes, which include comb-filter channel selection and subband reliability weighting, are designed to enhance the MBSC's peak at the most likely pitch period.
In the second application, a feature enhancement scheme using jointly-sparse reference and estimated soft-mask representations, is developed for noise-robust automatic speech recognition (ASR). Reference and estimated soft-mask exemplar-pairs are extracted from clean and noisy utterance-pairs in the training data. Using a sparsity-based dictionary learning algorithm, dictionary representations are trained from the exemplar-pairs. The sparse linear combination of estimated soft-mask dictionary representations that best approximates the test utterance's estimated soft-mask is applied to the reference soft-mask dictionary to produce an enhanced soft-mask. This enhanced soft-mask is then used to perform noise suppression on the spectrogram from which features for ASR are extracted.
In the third application, a simple exemplar-based sparse representation (SR) classifier is evaluated on limited data for birdsong phrase classification and verification. Song recordings of the Cassin's Vireo are used for performance evaluation. This study of the SR classifier for bird phrase classification is inspired by a paper that proposed the SR classifier for face recognition and outlier face detection, and reported good performance with only 7 training images per subject. Algorithmic enhancements are subsequently added to the original SR classification framework to improve the classification accuracy of automatically detected and segmented phrases, and phrases sang by bird individuals that are not found in the training set. These algorithmic enhancements include dynamic time warping (DTW) and frame-based feature normalization prior to SR classification. When the class decisions from DTW and first pass SR classification are different, SR classification is repeated with frequency-bin-normalized spectrographic features to resolve the two conflicting decisions.