- Magnuson, James S.;
- You, Heejo;
- Rueckl, Jay;
- Allopenna, Paul;
- Li, Monica;
- Luthra, Sahil;
- Steiner, Rachael;
- Nam, Hosung;
- Escabi, Monty;
- Brown, Kevin;
- Theodore, Rachel;
- Monto, Nicholas
Despite the lack of invariance problem (the many-to-manymapping between acoustics and percepts), we experiencephonetic constancy and typically perceive what a speakerintends. Models of human speech recognition have side-stepped this problem, working with abstract, idealized inputsand deferring the challenge of working with real speech. Incontrast, automatic speech recognition powered by deeplearning networks have allowed robust, real-world speechrecognition. However, the complexities of deep learningarchitectures and training regimens make it difficult to usethem to provide direct insights into mechanisms that maysupport human speech recognition. We developed a simplenetwork that borrows one element from automatic speechrecognition (long short-term memory nodes, which providedynamic memory for short and long spans). This allows thenetwork to learn to map real speech from multiple talkers tosemantic targets with high accuracy. Internal representationsemerge that resemble phonetically-organized responses inhuman superior temporal gyrus, suggesting that the modeldevelops a distributed phonological code despite no explicittraining on phonetic or phonemic targets. The ability to workwith real speech is a major advance for cognitive models ofhuman speech recognition.