Kwasny, Stan C.; Kalman, Barry L.; Wu, Weiland; Engebretson, A. Maynard

Identifying Language from Speech: An Example of High-Level, Statistically-Based Feature Extraction

1992

Abstract

We are studying the extraction of high-level features of raw speech that are statistically-based. Given carefully chosen features, we conjecture that extraction can be performed reliably and in real time. As an example of this process, w e demonstrate how speech samples can be classified reliably into categories according to what language was spoken. The success of our method depends critically on the distributional patterns of speech over time. We observe that spoken communication among humans utilizes a myriad of devices to convey messages, including frequency, pitch, sequencing, etc., as well as prosodic and durational properties of the signal. The complexity of interactions among these are difficult to capture in any simplistic model which has necessitated the use of models capable of addressing this complexity, such as hidden Markov models and neural networks. W e have chosen to use neural networks for this study. A neural network is trained from speech samples collected from fluent, bilingual speakers in an anechoic chamber. These samples are classified according to what language is being spoken and randomly grouped into uaining and testing sets. Training is conducted over a fixed, short interval (segment) of speech, while testing involves applying the network multiple times to segments within a larger, variable-size window. Plurality vote determines the classification. Empirically, the proper size of the window can be chosen to yield virtually 100% classification accuracy for English and French in the tests we have performed.

Main Content

For improved accessibility of PDF content, download the file to your device.

Proceedings of the Annual Meeting of the Cognitive Science Society

Identifying Language from Speech: An Example of High-Level, Statistically-Based Feature Extraction