Over the past few decades, engineering and statistical principles have played a significant role in the advancement of biology and medicine. In this dissertation, we study two problems which have applications in bioinformatics and epidemiology, respectively. We look at these problems through the lens of statistical inference, and propose exact and approximate algorithms.
The first line of work is on sequence reconstruction over deletion channels, which lies at the intersection of theoretical computer science, information theory and bioinformatics. Deletion channels have become increasingly relevant in bioinformatics to model impairments in DNA sequencing and DNA data storage. Prior works in theoretical computer science have proposed sequence alignment heuristics for deletion impairments. We take an alternate approach to this problem and propose algorithms that are rooted in statistical principles, such as maximum-likelihood decoding and maximum-aposteriori decoding. Our algorithms outperform the state-of-the-art algorithms in this space. Along the way, we also develop mathematical tools that are of general interest beyond the study of deletion channels.
The second line of work is on group testing which is related to information theory, medical testing and epidemiology. Group testing is a technique that pools together diagnostic samples from multiple individuals to identify infected individuals in a population. Such a technique potentially allows us to use fewer tests than what is required to test everyone individually. Here again, we improve upon the state-of-the-art group testing algorithms by developing a new statistical inference approach to this problem. Moreover, while traditional group testing assumes independent infections, we observe that contagious diseases like COVID-19 are governed by community spread. We show that taking into account the side information provided by the community structure may lead to significant savings – upto 60% fewer tests compared to traditional test designs in our experiments. We develop new bounds and new approaches to encoding and decoding algorithms that take into account the community structure and integrate group testing into epidemiological modeling.