- Main
Deep Learning of Discriminative Gene Expression Models and Generative Models of Molecule and Protein Sequences
- Honig, Edouardo
- Advisor(s): Wu, Yingnian
Abstract
Deep learning has enabled the creation of systems that appear more capable than humans across a myriad of domains, including understanding and generation of language, vision, games, finance, and social and physical sciences. While there exists a baseline level of human knowledge across all domains, there remains a large gap in the scientific understanding of the biological sequences that make up the building blocks of life. In this dissertation, we propose novel approaches to modeling genomic, molecular, and protein sequences to better decipher and construct said sequences. This effort aims to benefit the production of novel therapies and drugs to treat diseases. Specifically, we leverage and improve deep learning techniques to model the human genome, drug-like molecules, and proteins.
This dissertation is the culmination of four works. The first two focus on predicting single cell gene expression from the human reference genome. By using annotated representations of genetic sequences, we develop mathematical equations to inform scientific understanding of the contribution of various sections of the genome to gene expression. Additionally, we model long-range genetic effects by extending the length of input genetic sequence to our model, and confirm that the model's intermediate representations are consistent with our understanding of the genome. The third work develops a framework for conditional generation of molecules by iterative refinement, which may improve the discovery and development of drugs. Finally, we build an efficient model for protein sequences, aiming to bolster the production of useful proteins for industrial and medical applications.