Skip to main content
Open Access Publications from the University of California

Register Packing for Cyclic Reduction: A Case Study


We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our downsweep shared-memory communication reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction's performance on a GPU by a factor of 3--4.5x over the original CR implementation, making it 1.5--3x faster than other GPU tridiagonal solvers.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View