Addressing challenges for population genetic inference from next-generation sequencing
- Author(s): Han, Eunjung
- Advisor(s): Novembre, John
- Sinsheimer, Janet S
- et al.
Next-generation sequencing (NGS) data provides tremendous opportunities for making new discoveries in biology and medicine. However, a structure of NGS data poses many inherent challenges - for example, reads have high error rates, read mapping is sometimes uncertain, and coverage is variable and in many cases low or completely absent. These challenges make accurate individual-level genotype calls difficult and make downstream analysis based on genotypes problematic if genotype uncertainty is not accounted for. In this dissertation, I present recent works addressing challenges that arise in the analysis of NGS data for population genetic inferences and and provide recommendations and guidelines to interpret such data with precision. Throughout this dissertation, I focus on estimating the site frequency spectrum (SFS). The distribution of allele frequencies across polymorphic sites, also known as the SFS, is of primary interest in population genetics. It is a complete summary of sequence variation at unlinked sites and more generally, its shape reflects underlying population genetic processes.
First, I characterize biases that can arise inferring the SFS from low- to medium-coverage sequencing data and present a statistical method that can ameliorate such biases. I compare two approaches to estimate the SFS from sequencing data: one approach infers individual genotypes from aligned sequencing reads and then estimates the SFS based on the inferred genotypes (call-based approach) and the other approach directly estimates the SFS from aligned sequencing reads by maximum likelihood (direct estimation approach). I find that the SFS estimated by the direct estimation approach is unbiased even at low coverage, whereas the SFS by the call-based approach becomes biased as coverage decreases. The direction of the bias in the call-based approach depends on the pipeline to infer genotypes. Estimating genotypes by pooling individuals in a sample (multisample calling) results in underestimation of the number of rare variants, whereas estimating genotypes in each individual and merging them later (single-sample calling) leads to overestimation of rare variants. I characterize the impact of these biases on downstream analyses, such as demographic parameter estimation and genome-wide selection scans. This work highlights that depending on the pipeline used to infer the SFS, one can reach different conclusions in population genetic inference with the same data set. Thus, careful attention to the analysis pipeline and SFS estimation procedures is vital for population genetic inferences.
Next, I describe a development of a novel algorithm that can speed-up the existing direct estimation method with the EM optimization. The existing method directly estimates the SFS from sequencing data by first computing site likelihood vectors (i.e. the likelihood a site has a each possible allele frequency conditional on observed sequence reads) using a dynamic programming (DP) algorithm. Although this method produces an accurate SFS, computing the site likelihood vector is quadratic in the number of samples sequenced. To overcome this computational challenge, I propose an algorithm we call the adaptive K-restricted algorithm, which is linear in the number of genomes to compute the site likelihood vector. This algorithm works because in a lower triangular matrix that arises in the DP algorithm, all non-negligible values of the site likelihood vector are concentrated on a few cells around the best- guess allele counts. I show that this adaptive K-restricted algorithm has comparable accuracy but is faster than the original DP algorithm. This speed improvement makes SFS estimation practical when using low coverage NGS data from a large number of individuals.
Finally, as an application, I analyze high-coverage sequencing data of two dogs and three wolves to detect genetic signatures of adaptation during early dog domestication. This work is part of a larger research effort, called the Canid Genome Project, where I take the lead in the selection scans. We identify the importance of dietary evolution in early dog domestication, supported by our top selection hit, a CCRN4L gene. Moreover, we observe that genes affecting brain function, metabolism, and morphology show signatures of selection in the dog lineage.