Dynamically Determined Output Space For SpGEMM Using GPU Virtual Memory Management
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Davis

UC Davis Electronic Theses and Dissertations bannerUC Davis

Dynamically Determined Output Space For SpGEMM Using GPU Virtual Memory Management

Abstract

Sparse general matrix-matrix multiplication (SpGEMM) is a costly yet fundamental operation used across numerous scientificcomputing domains, such as machine learning, graph algorithms, and computational fluid dynamics. As GPUs are increasingly being used to accelerate scientific computing applications, optimizing SpGEMM for GPUs is a critical research area. One of the challenges in implementing SpGEMM on GPUs is determining the size of the output matrix, as the sparsity of the input matrices is not known beforehand. Precisely determining the size of the output matrix can add to the latency of the SpGEMM kernel, especially for large graphs.

This work presents an algorithm that dynamically expands the size of the output matrix as the SpGEMM kernel proceedswith the computation. This technique can improve memory utilization efficiency by avoiding unnecessary memory allocation at a modest performance cost. However, implementing a dynamic data structure like a vector to store the output matrix using the default cudaMalloc-based allocator is not memory efficient. This is because it has to allocate more memory than necessary when growing the vector, and there is latency associated with the cudaMemcpy() calls to copy the data from the old allocation to the new allocation. Hence, this work also presents a custom virtual memory-based allocator. Virtual memory-based allocations are better for growing allocations because new allocations can be created and mapped to a contiguous virtual address range with ease.

Our experiments show that the dynamic SpGEMM algorithm proposed in the work gives 70-99% memory efficiency with a modest performance cost when compared to the SpGEMM algorithm that uses the upper bound method for estimation. The proposed algorithm also consistently better performance for a modest loss in memory efficiency when compared to the SpGEMM algorithm that uses the precise estimation method. We also demonstrate that our custom virtual memory-based allocator can improve the memory efficiency and overhead latency of dynamic datastructures on GPUs.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View