Accounting for the phonetic value of nonspeech sounds
- Author(s): Finley, Gregory Peter;
- Advisor(s): Johnson, Keith A.;
- et al.
The nature of the process by which listeners parse auditory inputs into the phonetic percepts necessary for speech understanding is still only partially understood. Different theoretical stances frame the process as either the action of ordinary auditory processes or as the workings of a specialized speech perception system or module. Evidence that speech perception is special, at least on some level, can be found in perceptual phenomena that are associated with speech processing but not observed with other auditory stimuli. These include effects known to be related to top-down linguistic influence or even to the listener’s parsing of the speaker’s articulatory gestures.
There is mounting evidence, however, that these phenomena are not always restricted to speech stimuli: some nonspeech sounds, under certain presentation conditions, participate in these phonetic processes as well. These findings are enormously relevant to the theory of speech perception, as they suggest that a sharp speech/nonspeech dichotomy is untenable. Even more promising, they offer a way of reverse-engineering those aspects of speech perception that do not have a simple psychophysical explanation by observing how they react to stimuli that are carefully controlled, and may even be missing elements that are always present in speech. Experimental work that has attempted to do so are reviewed and discussed.
Original work extending these findings for two types of nonspeech stimuli is also presented. Under the first set of experiments, compensation for coarticulation is tested on a speech fricative target with a nonspeech context vowel (a synthesized glottal source with a single formant resonance). Results show that this nonspeech does induce a reliable context effect which cannot be due to auditory contrast. This effect is weaker than that induced by speech vowels, suggesting that listeners apply phonetic processing to a degree influenced by the plausibility of an acoustic event.
In the second set, listeners matched frequency-modulated tones to time-aligned visual CV syllables, in which rounding on the consonant and vowel varied independently. Results are consistent with those obtained in previous experiments with non-modulated tones: high tones are paired with high front vowel articulation, low tones with (back) rounded articulation. It is shown that this pitch-vowel correspondence is extensible to contexts that include spectrotemporal modulation at rates similar to speech. These findings are support for considering this effect to be a product of ordinary speech production rather than an unexplained idiosyncrasy in the auditory system.
The correspondences between nonspeech and speech sounds as reviewed and as noted in the above experiments were further evaluated on a spectral level. Much research has been done into modeling how listeners categorize speech spectra, and some of this research has identified certain cues as critical to phonetic categorization. Some of these models are further evaluated on nonspeech sounds: processing strategies that are indeed similar to human processing should predict the same phonetic categorizations, even on nonspeech, that human listeners perform. A comparison of full-spectrum versus formant-based models shows that the former much more accurately capture human judgments on the vowel quality of pure tones, and are also fairly effective at classifying formant-derived sine wave speech. Derived spectral measures, such as formants and cepstra are well tuned for speech but generally unable to imitate human performance on nonspeech.
All of these experiments support the notion that phonetic categorization for vowels and similar sounds operates by comparing spectral templates rather than highly derived spectral features such as formants. The observed correspondences between speech and nonspeech can be explained by spectral similarity, depending on both the presence and absence of spectral energy. More generally, the results support an inference-based understanding of speech perception in which listeners categorize based on maximizing the likelihood of an uttered phone given auditory input and scene analysis.