Skip to main content
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Use solid k-mers in minHash-based genome distance estimation


MinHash is a popular method for genome distance estimation. However, its requirement for input data quality is relatively strict, and its performance deteriorates if the input sequences are generated from sequencers with high sequencing error rates, especially from long-read sequencers. To solve this problem, in this thesis, we use solid (frequently occurring) k-mers as input to feed MinHash, and prove the effectiveness of this solid k-mer powered MinHash by comparing its performance in genome distance estimation with regular MinHash. In addition, we also discuss how to select the optimal threshold for solid k-mers in order to make the most of our solid k-mer powered MinHash.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View