Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Low Complexity Spectral Imputation for Noise Robust Speech Recognition

Abstract

With the recent push of Automatic Speech Recognition (ASR) capabilities to mobile devices, the user's voice is now recorded in environments with a potentially high level of background noise. To reduce the sensitivity of ASR performance to these distortions, techniques have been proposed that preprocess the speech waveforms to remove noise effects while preserving discriminative speech information. At the expense of increased complexity, recent algorithms have significantly improved recognition accuracy but remain far from human performance in highly noisy environments.

With a concern for both complexity and performance, this thesis investigated ways to reduce the corruptive effect of noise by directly weighting the power-spectrum (SMFpow) or log-spectrum (SMFlog) of speech by a mask whose values are within [0,1] and are indexed on the local relative prominence of speech and noise energy. Additional contributions include a low-complexity approach to mask estimation and the use of spectral flooring for matching the dynamic range of clean and noisy spectra. These two techniques are evaluated on two standard noisy ASR databases: the Aurora-2 connected digits recognition task with 11 words, and the Aurora-4 continuous speech recognition task with 5000 words.

On the Aurora-2 task, the SMFlog algorithm leads to state-of-the-art performance, with a limited complexity compared to existing techniques. The pow technique, however, results in many insertions that we attribute to the rather weak language model present in the Aurora-2 setup. On the Aurora-4 task, both algorithms show significant improvements over the un-enhanced baselines. In particular, word-accuracies obtained with pow approach those of a state-of-the-art front-end algorithm, on half of the noise types. Yet, the performances are heavily noise dependent, suggesting that the proposed technique is effective only given a good initial mask estimation.

This study confirms the potential of techniques that are based on direct spectrum masking, and proposes a framework for doing so. Future work will need to consider more elaborate mask estimation techniques to further improve on the performance.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View