Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Application of Information Theory to Modeling Exploration and Detecting Protein Coevolution

Abstract

In this thesis I introduce novel applications of information theory to two fundamental problems: the modelling of learning-driven exploration and the identification of coevolving protein residues. While sharing a common approach in the use of information theoretic constructs, they each represent significant contributions to their respective fields.

Discovering the structure underlying observed data is a recurring problem in machine learning with important applications in neuroscience. It is also a primary function of the brain. When data can be actively collected in the context of a closed action-perception loop, behavior becomes a critical determinant of learning efficiency. Psychologists studying exploration and curiosity in humans and animals have long argued that learning itself is a primary motivator of behavior. However, the theoretical basis of learning-driven behavior is not well understood. Previous computational studies of behavior have largely focused on the control problem of maximizing acquisition of rewards and have treated learning the structure of data as a secondary<\italic> objective. Here, I study exploration in the absence of external reward feedback. Instead, I take the quality of an agent's learned internal model to be the primary objective. In a simple probabilistic framework, I derive a Bayesian estimate for the amount of information about the environment an agent can expect to receive by taking an action, a measure I term the predicted information gain (PIG). I develop exploration strategies that approximately maximize PIG. One strategy based on value-iteration consistently learns faster, across a diverse range of environments, than previously developed reward-free exploration strategies. Psychologists believe the evolutionary advantage of learning-driven exploration lies in the generalized utility of an accurate internal model. Consistent with this hypothesis, I demonstrate that agents that learn more efficiently during exploration are later better able to accomplish a range of goal-directed tasks. I will conclude by discussing how our work elucidates the explorative behaviors of animals and humans, its relationship to other computational models of behavior, and its potential application to experimental design, such as in closed-loop neurophysiology studies.

The structure and function of a protein is dependent on coordinated interactions between its residues. The selective pressures associated with a mutation at one site should therefore depend on the amino acid identity of interacting sites. Mutual information has previously been applied to multiple sequence alignments as a means of detecting coevolutionary interactions. Here, I introduce a refinement of the mutual information method that: 1) removes a significant, non-coevolutionary bias and 2) accounts for heteroscedasticity. Using a large, non-overlapping database of protein alignments, I demonstrate that predicted coevolving residue-pairs tend to lie in close physical proximity. I introduce coevolution potentials as a novel measure of the propensity for the 20 amino acids to pair amongst predicted coevolutionary interactions. Ionic, hydrogen, and disulfide bond-forming pairs exhibited the highest potentials. Finally, I demonstrate that pairs of catalytic residues have a significantly increased likelihood to be identified as coevolving. These correlations to distinct protein features verify the accuracy of our algorithm and are consistent with a model of coevolution in which selective pressures towards preserving residue interactions act to shape the mutational landscape of a protein by restricting the set of admissible neutral mutations.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View