The high levels goals of this thesis are to: understand the neural representation of sound, produce more robust statistical models of natural sound, and develop models for top-down auditory attention. These are three critical concepts in the auditory system. The neural representation of sound should provide a useful representation for building robust statistical models and directing attention. Robust statistical models are necessary for humans to generalize their knowledge from one domain to the plethora of domains in the real world. And attention is fundamental to the perception of sound, allowing one to prioritize information in the raw audio signal.
First, I approach understanding the neural representation of sound using the efficient coding principle and the physiological characteristics of the cochlea. A theoretical model is developed using convolutional filters and leaky-integrate-and-fire (LIF) neurons to model the cochlear transform and spiking code of the auditory nerve. The goal of this model is to explain the distributed phase code of the auditory nerve response but it lays the foundation for much more.
Second, I investigate an algorithm for audio source separation, called deep clustering. Experiments are performed to evaluate it's robustness, and a new neural network architecture is developed to improve robustness. The experiments show that the conventional recurrent neural network performs sub-optimally, and our dilated convolutional neural network improves robustness while using an order of magnitude fewer parameters. This more parsimonious model is a step towards models which are minimally parameterized and generalize well across many domains.
Third, I develop a new algorithm to address the limitations of the previous deep clustering method. This algorithm can extract multiple sources at once from a mixture using an attentional context or bias. It relies on modulating the computation of the bottom-up pathway using a top-down neural signal, which indicates which sources are of interest. A simple idea from the attentional spotlight method is used to do this: to allow for the top-down neural signal to modulate the gain on a set of low level neurons. This computational method demonstrates one way top-down feedback could direct auditory attention in the brain. Interestingly, this method goes beyond neuroscience, it demonstrates that attention can be about more than efficient computation. The experiments show that it resolves one of the main short comings of deep clustering. The model can extract sources from a mixture without knowing the total number of sources in the mixture, unlike deep clustering.
The major contributions of this work are a theoretical model for the auditory nerve response, a more robust neural network architecture for sound understanding, and a novel and powerful model of top-down auditory attention. I hope that the first contribution will be used to build a better understanding of the complex auditory nerve code. The second to build ever more parsimonious and robust models of source separation. And the third to provide a basis for an under-explored research direction which I believe is the most fruitful for building human-level auditory scene analysis, attention-based source separation.