Xu, Weihong

High-performance Software and Hardware Designs for Genomics and Proteomics

2024

Xu, Weihong
Advisor(s): Rosing, Tajana Šimunić

Abstract

Genomics and proteomics are at the forefront of innovations in precision medicine and drug discovery. However, the rapid data expansion in these fields presents significant computational challenges, emphasizing the need for more efficient algorithm and hardware designs. Current research overlooks systematic acceleration from both software and hardware aspects. This dissertation bridges these gaps by presenting high-performance designs that enhance the efficiency, accuracy, and scalability of data analysis in genomics and proteomics.

Genome alignment is crucial for evaluating sequence similarity in genomics, but existing solutions are hindered by high memory footprints and computational complexity. To address these challenges, this thesis introduces RAPIDx, an algorithm and hardware co-design that enhances the efficiency and throughput of genome alignment. RAPIDx leverages Processing-in-Memory (PIM) techniques for in-situ computation, significantly boosting energy efficiency. It also employs an adaptive banded alignment algorithm tailored for ReRAM-based PIM architectures, reducing computational complexity and memory requirements while maintaining high accuracy. The proposed PIM architecture achieves up to 131.1× and 46.8× throughput improvement over the state-of-the-art CPU and GPU implementations, respectively.

RAPIDx delivers high accuracy across various genome analysis tasks, but its substantial memory consumption makes it unsuitable for latency-sensitive scenarios or resource-constrained hardware. To address these limitations, this thesis proposes HyperGen, a memory-efficient genome sketching tool that eliminates the need for the costly alignment. HyperGen leverages hyperdimensional computing (HDC) to significantly improve runtime performance, memory efficiency, and accuracy in large-scale genomic analyses, enabling rapid and precise Average Nucleotide Identity (ANI) estimation. The tool demonstrates superior performance in both genome sketching and database search tasks.

Proteomics, using mass spectrometry (MS) to analyze proteins, provides deep insights into cellular functions and disease mechanisms. MS clustering is crucial for organizing and interpreting these datasets, enabling more efficient identification of proteins and peptides. However, the demand for accurate, fast, and scalable algorithms presents a significant challenge for large-scale analyses. To address this, this thesis introduces HyperSpec, a high-performance tool that accelerates spectral clustering by leveraging the lightweight, parallelizable nature of HDC. HyperSpec reduces clustering runtime while maintaining high quality, cutting the processing time of 21 million spectra from 4 hours to just 24 minutes.

Despite HyperSpec’s significant speedup to MS clustering, our profiling analysis reveals that MS data preprocessing remains the primary bottleneck, due to the inefficient data path of conventional Von Neumann architecture. To overcome this, a near-storage accelerator, MSAS, is presented to speed up MS data preprocessing. By processing spectra close to the storage medium, MSAS minimizes costly data movement between storage and computation units. Its channel-level design achieves up to 187× speedup compared to CPU-based preprocessing and outperforms existing in-storage computing solutions. When integrated into existing MS clustering tools, MSAS enhances overall system performance, yielding 3.5× to 9.8× improvements in speed and 2.8× to 11.9× gains in energy efficiency.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC San Diego

High-performance Software and Hardware Designs for Genomics and Proteomics