Search

Scholarly Works (7 results)

Sort By:

Thesis
Peer Reviewed

Data-Triggered Threads /

Tseng, Hung-Wei

UC San Diego Electronic Theses and Dissertations (2014)

This thesis introduces the data-triggered threads (DTT) programming and execution model. Unlike threads in conventional parallel programming models, the DTT model initiates threads on changes to memory locations. This enables increased parallelism and the elimination of redundant, unnecessary computation. This thesis shows that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through the DTT model, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46% with architectural support. To improve the generality of the DTT model, this thesis also demonstrates a software-only runtime system that allows DTT programs running on top of existing machines. With mechanisms to minimize the multithreading overhead and dynamically turning on/off the DTT model, the software runtime system improves the performance of serial C SPEC benchmarks by 15% on a Nehalem processor, but by over 7X over the full suite of single-thread applications. We also show that the DTT model can work in conjunction with traditional parallelism using the software-only framework. The DTT model provides up to 64X speedup over parallel applications exploiting traditional parallelism. This thesis also discusses CDTT, a compiler framework that takes C/C++ code and automatically generates a binary that applies the DTT model to eliminate dynamically redundant code without programmer intervention. With the help of idempotence analysis and inter-procedural name dependence analysis, CDTT identifies potential code regions and composes support thread functions that as soon as live-in data changes. CDTT can also use profile data to target the elimination of redundant computation. The compiled binary running on top of a software runtime system can achieve nearly the same level of performance as careful hand-coded modifications in most benchmarks. CDTT improves the performance of serial C SPEC benchmarks by as much as 57% (average 11%) on a Nehalem processor

Thesis
Peer Reviewed

Innovative Approaches to Hardware Acceleration Through Performance Analysis and Program Design

Wudenhe, Abenezer
Advisor(s): Tseng, Hung-Wei

UC Riverside Electronic Theses and Dissertations (2024)

The proliferation of new Artificial Intelligence (AI) and Machine Learning (ML) accelerators has enhanced the performance of domain-specific applications with tightly integrated software stacks. However, this focus often overlooks other critical applications that could benefit from these unique architectures. This dissertation examines whether AI/ML applications fully utilize these architectures, proposes an alternative to tightly integrated software stacks, and presents a novel approach to evaluating accelerators for both domain-specific and broader applications through three bodies of work. These three works collectively aim to expand the application domains of accelerators, benefiting a wide range of critical applications.

The first work presents TPUPoint, a profiling and optimization tool that assesses Google's Tensor Processing Units (TPUs). It addresses the issue of underutilized accelerators by classifying repetitive patterns into phases and identifying timing-critical operations within each phase. TPUPoint demonstrates that despite being designed for AI/ML, these accelerators may not be used to their full potential. Prompting the question of whether other applications outside AI/ML might better utilize these devices.

The second work, T2SP, seeks to overcome the limitation of accelerators restricted to specific software stacks. It focuses on achieving platform-agnostic tensor computations by combining Data Parallel C++ (DPC++) and T2X, a framework that separates functional specifications from spatial mappings for architectures like FPGAs and CGRAs. This approach ensures portability, efficient hardware utilization, and ease of development by allowing users to create implementations that are not confined to specific architectures.

The final work, Accel-Bench, is a benchmark suite designed to quantify the performance gains from using hardware-accelerated functions across various application domains, both within and outside AI/ML.Accel-Bench includes ten applications that utilize hardware-accelerated functions such as GEMM, CONV, and FFT. The suite shows that applications can achieve comparable or superior performance with hardware accelerators, even with increased computational complexity.

Together, these projects provide comprehensive solutions for evaluating performance, enabling portability, and diversifying applications across domains, advancing the field of hardware-accelerated computing.

Cover page: Innovative Approaches to Hardware Acceleration Through Performance Analysis and Program Design

Thesis
Peer Reviewed

Democratizing Tensor Processors: Efficient and Generalized Tensor Computation with Architectural Support

Zhang, Yunan
Advisor(s): Tseng, Hung-Wei

UC Riverside Electronic Theses and Dissertations (2024)

Tensor processors, notably matrix units (MXUs), have become indispensable in accelerating matrix operations for machine learning. However, their specialized design and limited support for varying data types and operators have hindered wider adoption. This dissertation tackles these limitations by enhancing the flexibility and capabilities of tensor processors across three key areas.

First, multi-mode matrix processing units (\MPCMXU{}) are introduced, capable of efficiently handling both IEEE 754 single-precision and complex 32-bit floating-point numbers. This innovation broadens the applicability of MXUs in scientific computing without requiring significant modifications to existing systems.

Second, \SIMDD{}, a novel programming paradigm and architecture, is proposed to extend MXU capabilities beyond matrix multiplications to a wider range of generalized matrix operations. By leveraging existing tensor processor infrastructure, \SIMDD{} offers substantial performance improvements over traditional approaches, further expanding the utility of these processors.

Finally, to address the challenges of memory-bound sparse tensor computations, a new compute dataflow, \underline{O}utput-stationary-\underline{E}lement-wise-\underline{I}nput-stationary (\OEI{}), and its corresponding architecture, SIDA, are presented. This combined approach exploits inter- and intra-operator reuse opportunities, significantly reducing memory traffic and enhancing the efficiency of tensor processors in sparse linear algebra workloads.

1 supplemental PDF

Cover page: Democratizing Tensor Processors: Efficient and Generalized Tensor Computation with Architectural Support

Creative Commons 'BY-NC-ND' version 4.0 license

Thesis
Peer Reviewed

General-Purpose Computing on Tensor Processors

Hsu, Kuan-Chieh
Advisor(s): Tseng, Hung-Wei

UC Riverside Electronic Theses and Dissertations (2024)

Modern computer systems have become heterogeneous and consist of many emerging kinds of hardware accelerators as Dennard Scaling discontinues. Also, such domain-specific hard- ware accelerators fulfill the rapidly growing computing demands for applications including artificial intelligence (AI) and machine learning (ML). Beyond conventional computer components such as central processing units (CPUs) and memory, modern computers typically contain accelerators such as graphic processing units (GPUs), tensor processing units (TPUs), and neural processing units (NPUs). Although accelerators have various program- ming interfaces and execution models, a group of accelerators are tensor processors that improve system performance for any problem that uses matrix or tensors as input and/or outputs. Despite the differences among the microarchitectural designs of each, tensor processors essentially are hardware accelerators that focus on providing efficient matrix-based computation solutions.In this dissertation, we envision a new programming paradigm that leverages tensor processors for general-purpose computing beyond the original application domains for AI and ML. The framework should contain the following characteristics. First, the program- ming interface design for a heterogeneous system with tensor processors must be simple, easy to use, and can maintain great compatibility and portability across various systems. Second, the execution model of the framework should intelligently explore and exploit opportunities by using the tensor processors that deliver better performance and extend the spectrum of application domains. Finally, the framework solution must be cost-effective and energy-efficient and be able to accommodate algorithm redesign and transformation that support broader usages. I proposed three research works in response to the envision. First, I proposed GPTPU, an open-source, open-architecture framework that allows users to explore the usage opportunity of tensor processors for general applications. Second, I proposed SHMT, a new programming and execution model that enables simultaneous parallel processing using heterogeneous processing units for the same function. Lastly, I proposed GSLD, a matrix computing library accommodating either dense or sparse matrix inputs that more intelligently uses dense matrix processors and scalar cores.

Cover page: General-Purpose Computing on Tensor Processors

Thesis
Peer Reviewed

Efficient Accelerator-Rich Computers for Future Applications

Hu, Yu-Ching
Advisor(s): Tseng, Hung-Wei

UC Riverside Electronic Theses and Dissertations (2024)

With the advancement of processor technology, numerous hardware accelerators beyond CPUs and GPUs are emerging to meet the rapid growth in computation demands. Particularly, the demand for AI and ML applications is outpacing the improvements in general-purpose hardware, prompting researchers to integrate hardware accelerators into architectural designs. This raises a fundamental research question: Are we fully exploiting these AI/ML hardware accelerators?

This dissertation addresses this question from multiple perspectives. Firstly, have we used approximate hardware efficiently and effectively? Optimal performance requires that the system supplies data smoothly to powerful computing units. Secondly, the portability of hardware accelerators. While designed for compute-intensive AI/ML workloads, can other domains benefit from these accelerators? Lastly, do we need more accelerators, or are current ones sufficient for evolving AI-assisted applications?

To answer the first question, I proposed Varifocal Storage (VS), an architecture that reduces unnecessary data via in-storage processing, mitigating data traffic within interconnects and minimizing data transformation overhead. By dynamically adjusting data resolutions, VS addresses the demands for performance, flexibility, cost, and quality, necessitating a hardware/software co-design within the approximate computing framework.

For the second question, I proposed TCUDB, a relational database query engine leveraging Tensor Core to significantly accelerate SQL query processing, achieving orders of magnitude speedup even for non-AI/ML queries. TCUDB revisits application algorithms and data layouts for emerging hardware accelerators, demonstrating versatility across various analytic queries and use cases including matrix multiplication, entity matching, and graph applications.

Finally, driven by the growth in AI-based personal assistant applications and the shift from traditional PCs to mobile devices, I proposed the Personal Assistant Multi-device Machine Learning Benchmark (PAMLB) to address complexities in data processing pipelines. Existing benchmarks like Rodinia and TPC-H fail to capture the real-world experience of AI-assisted applications, which heavily rely on small user devices and data center interactions. PAMLB aims to develop comprehensive workloads to optimize/deploy modules on different devices for these advanced applications.

Cover page: Efficient Accelerator-Rich Computers for Future Applications

Thesis
Peer Reviewed

Rethinking the Programming Interface in Future Heterogeneous Computers

Liu, Yu-Chia
Advisor(s): Tseng, Hung-Wei

UC Riverside Electronic Theses and Dissertations (2022)

Computer systems have become more heterogeneous due to the breakdown of Dennard Scaling and the rapid growth of application demands. In addition to just having general-purpose processors, both factors have pushed modern computers to embrace hardware accelerators that are specialized for such as graphics and AI/ML domains. Besides hardware accelerators, because of the limited bandwidth provided by interconnection among the hardware components, we have seen the development of in-memory processing units and computational storage that also help with performance and thus diminish the boundary between processing units and memory in heterogeneous systems. Even though emerging hardware components in heterogeneous computers provide rich opportunities for performance improvement, programming frameworks that lack flexible programmability and proper interfaces limit the power of heterogeneous systems.In this dissertation, we envision an efficient and effective programming framework for future heterogeneous computers, and we propose the framework should contain the following characteristics. First, the interface for the heterogeneous systems must fulfill the demand of applications while maintaining the generality for a broad spectrum of applications to minimize the overhead of data representations in different system modules. Second, the programming framework for heterogeneous systems should intelligently identify the opportunities of using available hardware resources to deliver better performance and provide easy programmability. Finally, the programming interface must make applications easily adopt future accelerators or processing units. I have proposed three different works based on the envision. First, I have proposed NDS, an efficient storage interface that fulfills the various application demands of data objects and gauges the underlying memory-device architectures from application demands to minimize the overhead of transforming data representations. Second, I have proposed ActivePy, a programming framework that automatically identifies the potential code regions for computational storage, generates efficient code, and distributes tasks for the best performance without any programmer’s intervention. Lastly, I proposed UDSL, a potential programming paradigm that allows a program to scale easily with the advance of hardware accelerators or any future hardware.

Cover page: Rethinking the Programming Interface in Future Heterogeneous Computers

Creative Commons 'BY' version 4.0 license

Article
Peer Reviewed

Gullfoss: Accelerating and Simplifying Data Movement among Heterogeneous Computing and Storage Resources

Technical Reports (2015)

High-end computer systems increasingly rely on heterogeneous computing resources. For instance, a datacenter server might include multiple CPUs, high-end GPUs, PCIe SSDs, and high-speed networking interface cards. All of these components provide computing resources and operate at a high bandwidth. Coordinating the movement of data and scheduling computation across these resources is a complex task, as current programming models require system developers to explicitly schedule data transfers. Moving data is also inefficient in terms of both performance and energy costs: some applications running on GPU-equipped systems spend over 55% of their execution time and 53% of energy moving data between the storage device and the GPU. This paper proposes Gullfoss, a system that provides a simplified programming model for these heterogeneous computing systems. Gullfoss provides a high-level interface for specifying an application’s data movement requirements, and dynamically schedules data transfers while accounting for current system load and program requirements. Our initial implementation of Gullfoss focuses on data transfers between an SSD and a GPU, eliminating wasteful transfers to and from main memory as data moves between the two. This saves memory energy and bandwidth, leaving the CPU free to do useful work or operate at a lower frequency to improve energy efficiency. We implement and evaluate Gullfoss using commercially available hardware components. Gullfoss achieves 1.46× speedup, reduces energy consumption by 28%, and improves energy-delay product by 41%, comparing with systems without Gullfoss. For multi-program workloads, Gullfoss shows 1.5× speedup. Gullfoss also improves the performance of a GPU-based MapReduce framework by 10%.

Pre-2018 CSE ID: CS2015-1015

Cover page: Gullfoss: Accelerating and Simplifying Data Movement among Heterogeneous Computing and Storage Resources