Use solid k-mers in minHash-based genome distance estimation
MinHash is a popular method for genome distance estimation. However, its requirement for input data quality is relatively strict, and its performance deteriorates if the input sequences are generated from sequencers with high sequencing error rates, especially from long-read sequencers. To solve this problem, in this thesis, we use solid (frequently occurring) k-mers as input to feed MinHash, and prove the effectiveness of this solid k-mer powered MinHash by comparing its performance in genome distance estimation with regular MinHash. In addition, we also discuss how to select the optimal threshold for solid k-mers in order to make the most of our solid k-mer powered MinHash.