Temporal organization in vocal communication: sequential structure, perceptual integration, and neural foundations
Skip to main content
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Temporal organization in vocal communication: sequential structure, perceptual integration, and neural foundations


Our interactions with the world unfold over time. Whether it's speaking, where one word follows the next, or walking, where each step follows another, the organization of our behaviors in time tends to follow a predictable pattern. Those patterns are dictated by a multitude of underlying factors, influenced both by endogenous physiological factors like the rhythmic nature of our gait as well as by exogenous factors, like the social dynamics underlying turn-taking while speaking. Despite decades of research studying the temporal organization of behavior, dating back to the work of influential biologists like Tinbergen, Lashley, and Dawkins, little is known about the physiological substrates that underlie either the production of the sequential organization of most aspects of behavior. Despite widespread acknowledgment that physiological motor programs and many non-linguistic behaviors are hierarchical, for example, few physiological investigations into the dynamics of behavior extend beyond low-order (Markovian) transition statistics.

In this thesis, I build onto the emerging field of computational neuroethology to further our understanding of what structure underlies the sequential organization of behavior, what physiological mechanisms might be involved in producing, perceiving, and representing sequential behavioral organization, and how sequential behavioral organization might have emerged developmentally and evolutionarily. Throughout the thesis, I draw primarily upon birdsong and human speech, developing methods to analyze the acoustic and temporal structure in vocal signals and then behaviorally and physiologically probing the underpinnings of sequential organization in the songbird. This work advances the field of computational neuroethology in several ways.I uncover novel acoustic structure in vocal signals separating avian and mammalian vocalizations along a spectrum of vocal stereotypy. I observe that both human speech and birdsong are characterized by a combination of long and short-range temporal patterning. I find that the long-range temporal patterning characterizing human speech, believed to be underlied by hierarchical linguistic organization, is present at the earliest developmental stages of human speech, well before complex syntax is produced. I find that the perceptual integration of birdsong syllable sequences can be well explained by Bayesian models of probabilistic perceptual decision-making. Finally, I find that sensory neural representations of syllable sequences are modulated by sequential context and that this modulation reflects the animals underlying perceptual behavior. In the following paragraphs, I give a brief overview of the methods and major results of the chapters comprising this thesis.

In Chapter \ref{chapter:review} I give an introduction to the emerging field of vocal computational neuroethology. This introduction contextualizes the following chapters in a review of current work. I emphasize current tools, challenges, and future directions in vocal neuroethology. I start with a discussion of low-level bioacoustics challenges and build up to a discussion of behavioral organization and physiology. I first discuss challenges in signal processing such as dealing with noise and signals and representing vocal signals as time-frequency representations. I then discuss machine learning approaches used to identify, segment, and label vocalizations. Next, I discuss how to extract relational structure between vocalizations, and cluster latent projections of vocalizations. I then give an overview of methods for capturing temporal relationships in vocal sequences, outlining traditional Markovian descriptions of vocal structure, and new tools for capturing long-range structure, enabled by large datasets. I then move on to machine learning tools that can be used to systematically control and synthesize vocal signals from learned vocal spaces. Finally, I discuss how these techniques are being utilized in several active areas of neuroethology research.

In Chapter \ref{chapter:avgn} I develop a set of methods to visualize and quantify relational structure in vocalizations, which enable the analyses and experiments performed in the following chapters. I use graph-based dimensionality reduction to uncover local structure in vocal communication signals and apply that technique to 19 datasets consisting of vocalizations from 29 species, including songbirds, primates, cetaceans, rodents, and bats. I observe that these methods uncover novel structure in animal vocal signals, including vocal dialects, acoustic units, behaviorally relevant signal information, and sub-syllabic structure.

In Chapter \ref{chapter:parametric_umap}, I extend the methods from Chapter \ref{chapter:avgn} by introducing Parametric UMAP, a graph-based dimensionality reduction algorithm that parametrically learns the relationship between data (here vocal signals) and latent embeddings. Parametric UMAP enables the methods from Chapter \ref{chapter:avgn} to be applied in real-time closed-looped settings over larger datasets due to the learned parametric embeddings. I show that this algorithm has applications in semi-supervised settings, and provides additional control over the trade-off between capturing global and local structure in embeddings.

In Chapter \ref{chapter:parallels} I explore the long and short-range temporal patterning of vocal sequences in birdsong and human speech. I use an information-theoretic framework to analyze statistical dependencies as a function of the distance between elements in vocal sequences. I find that both birdsong and human speech exhibit two forms of structure: short-range relationships captured by Markovian dynamics over short-timescales, and long-range relationships that follow a power-law occurring over longer timescales. In language, the observed short-range organization conforms to phonological processes, which are well-described by finite-state dynamics, while long-range organization suggests more complex dynamics such as underlying hierarchical organization. Previous analyses of birdsong have only identified short-range Markovian dynamics, making our observation of long-range dynamics in birdsong novel.

In Chapter \ref{chapter:lri} I extend our experiment from chapter \ref{chapter:parallels} over human speech to language acquisition. By analyzing corpora of speech throughout language development, we can observe the time course of the emergence of long and short-range relationships over development. Surprisingly, I find that long-range statistical dependencies are present in children's speech as early as 6-12 months, well before complex syntactic structure is present. I discuss these results alongside emerging evidence from computational ethology that long-range relationships are also common to non-linguistic behavioral signals from animals as diverse as zebrafish, drosophila, and whales. Although previous analyses of long-range relationships have suggested that long-range relationships are the product of hierarchical linguistic structure such as syntax and discourse structure, our observations in developmental speech and non-linguistic behaviors suggest that other mechanisms may also be at play.

Finally, in Chapter \ref{chapter:cdcp} I probe how sequential dependencies in vocal sequences are integrated behaviorally and physiologically. I developed a behavioral task in which European starlings are trained to classify morphs of syllables of starling song synthesized from an interpolation between two points in the latent space of a neural network (a Variational Autoencoder). These morph syllables are preceded with a separate syllable (a cue syllable), which holds predictive information about the category of the following morph syllable. I find that classification of the morph syllable is contextually modulated by the predictive probability of the cue syllable, which can be well explained by a model of Bayesian integration. With the same behavioral paradigm, I then record chronic electrophysiology data from auditory nuclei while birds performed this context-dependent categorical perceptual decision-making task. I find that neural activity patterns reflect several aspects of our model of perceptual behavior, including the uncertainty in decision making, and prediction-related perceptual modulation.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View