Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Alignment-free genomic distance estimation: from conventional methods to machine learning

No data is associated with this publication.
Abstract

Inferring evolutionary relationships based on comparative analysis of genomic data remains a fundamental question in biology. Conventionally, these analyses involve cumbersome and computationally expensive steps such as assembly, gene annotation, and multiple sequence alignment. Alternatively, phylogenomic analyses can be conducted using alignment-free approaches, often using k-mers to compute the evolutionary distance. However, despite being fast and accurate, k-mer-based methods have their own challenges. Crucially, this approach can be used with low-coverage sequencing of samples (i.e., genome skims), which can reduce costs.

A major challenge in analyzing genome skims is the presence of extraneous sequences in genomic data. We show that contaminants reoccurring in multiple samples can impact k-mer-based distance estimation and thus phylogenetic inference. To combat this problem, we introduce CONSULT, an algorithm for efficiently removing extraneous reads from sequencing samples. We demonstrate that CONSULT has higher accuracy for contamination detection than leading methods such as Kraken-II and improves distance calculation for genome skims. Additionally, we show that CONSULT can be used to distinguish organelle reads from nuclear reads, improving the quality of skims-based mitochondrial assemblies.

Another challenge in using k-mer-based phylogenetic methods is the absence of a solid statistical procedure to estimate uncertainty, limiting the use of these methods in practice. To address this problem, we developed an algorithm for quantifying the uncertainty of alignment-free phylogenies using subsampling and relying on sound statistical principles. We demonstrate that our method is reasonably fast and can correctly identify uncertain branches on phylogenies constructed using real and simulated datasets.

As a final challenge, we tackle the problem of updating phylogenies with new genomes while avoiding alignment or even assembly. As sequencing data becomes readily available, de novo tree reconstruction becomes infeasible. However, placement into an existing tree provides an efficient alternative. Attempts in alignment-free phylogenetic placement have both scalability and accuracy limitations. We approach this problem by representing each genome as a vector of k-mer frequencies and leveraging machine learning to estimate distances between such vectors.We demonstrate that our method, kf2d, outperforms existing k-mer-based approaches in distance calculation and allows placing new samples on phylogenies constructed from heterogeneous data types.

Main Content

This item is under embargo until October 2, 2024.