Register Packing for Cyclic Reduction: A Case Study
Published Web Locationhttps://doi.org/10.1145/1964179.1964185
We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our downsweep shared-memory communication reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction's performance on a GPU by a factor of 3--4.5x over the original CR implementation, making it 1.5--3x faster than other GPU tridiagonal solvers.