With the advancement of processor technology, numerous hardware accelerators beyond CPUs and GPUs are emerging to meet the rapid growth in computation demands. Particularly, the demand for AI and ML applications is outpacing the improvements in general-purpose hardware, prompting researchers to integrate hardware accelerators into architectural designs. This raises a fundamental research question: Are we fully exploiting these AI/ML hardware accelerators?
This dissertation addresses this question from multiple perspectives. Firstly, have we used approximate hardware efficiently and effectively? Optimal performance requires that the system supplies data smoothly to powerful computing units. Secondly, the portability of hardware accelerators. While designed for compute-intensive AI/ML workloads, can other domains benefit from these accelerators? Lastly, do we need more accelerators, or are current ones sufficient for evolving AI-assisted applications?
To answer the first question, I proposed Varifocal Storage (VS), an architecture that reduces unnecessary data via in-storage processing, mitigating data traffic within interconnects and minimizing data transformation overhead. By dynamically adjusting data resolutions, VS addresses the demands for performance, flexibility, cost, and quality, necessitating a hardware/software co-design within the approximate computing framework.
For the second question, I proposed TCUDB, a relational database query engine leveraging Tensor Core to significantly accelerate SQL query processing, achieving orders of magnitude speedup even for non-AI/ML queries. TCUDB revisits application algorithms and data layouts for emerging hardware accelerators, demonstrating versatility across various analytic queries and use cases including matrix multiplication, entity matching, and graph applications.
Finally, driven by the growth in AI-based personal assistant applications and the shift from traditional PCs to mobile devices, I proposed the Personal Assistant Multi-device Machine Learning Benchmark (PAMLB) to address complexities in data processing pipelines. Existing benchmarks like Rodinia and TPC-H fail to capture the real-world experience of AI-assisted applications, which heavily rely on small user devices and data center interactions. PAMLB aims to develop comprehensive workloads to optimize/deploy modules on different devices for these advanced applications.