- Author(s): Ashkiani, Saman
- Davidson, Andrew
- Meyer, Ulrich
- Owens, John D
- et al.
Published Web Locationhttps://doi.org/10.1145/2851141.2851169
Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer.Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient.In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets.In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations.We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses.On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0--6.7x (4.4--8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s.
Many UC-authored scholarly publications are freely available on this site because of the UC Academic Senate's Open Access Policy. Let us know how this access is important for you.