Songbirds are widely studied as an animal model to accelerate the development of neurally driven speech prostheses. Consequently, the analyses carried out on the songbird model are dependent on high-quality labeled datasets. Building such datasets is usually laborious, as it involves manually annotating their vocal behavior, after which, neural activity can be decoded into vocalizations. However, the direct translation to vocal behavior is quite challenging since songbird vocalizations are typically recorded at high sampling rates to capture the rapid changes in behavior.
In this thesis, such problems are addressed using data-driven approaches. To reduce the efforts involved in manual annotations, deep learning models are explored for automatic labeling of vocalizations. A model based on convolutional and recurrent layers called TweetyNet is trained with a small amount of manually labeled data comprising features from audio. It is shown to achieve high frame-level sensitivity and high temporal precision in annotating the vocalizations of adult male zebra finches whose recordings were collected in-house. Alternatively, a WaveNet-based fully convolutional model trained directly on audio, is also shown to provide high temporal precision in annotations.
To minimize the complexities involved in the direct translation of neural activity to behavior, an intermediate stage is introduced in the neural decoding pipeline. This stage will encode a low-dimensional representation of the behavior from which vocalizations can be reconstructed. A Vector Quantized Variational Autoencoder is trained to learn the latent representation of zebra finch vocalizations. Additionally, from these latent representations, novel stimuli are generated for use in psychophysical experiments.
While the stereotypical behavior of songbirds is widely studied, for practical vocal prostheses it is equally important to be able to decode the non-stereotypical behavior like calls, which can be of multiple types. The different calls in zebra finches are identified using Uniform Manifold Approximation and Projection, and the spike counts are determined from their corresponding neural activity. Gaussian classification of spike counts is shown to achieve high accuracy corroborating the mutual information existing between spike counts and call type.
These techniques and analyses are believed to provide insights for the development of songbird vocal prostheses.