Graphics Processing Units (GPUs) are massively parallel processors with thousands of active threads originally designed for throughput-oriented tasks.In order to get as much performance as possible given the hardware characteristics of GPUs, it is extremely important for programmers to not only design an efficient algorithm with good enough asymptotic complexities, but also to take into account the hardware limitations and preferences.In this work, we focus our design on two high level abstractions: work assignment and processing. The former denotes the assigned task by the programmer to each thread or group of threads. The latter encapsulates the actual execution of assigned tasks.
Previous work conflates work assignment and processing into similar granularities. The most traditional way is to have per-thread work assignment followed by per-thread processing of that assigned work. Each thread sequentially processes a part of input and then the results are combined appropriately. In this work, we use this approach in implementing various algorithms for the string matching problem (finding all instances of a pattern within a larger text).Another effective but less popular idea is per-warp work assignment followed by per-warp processing of that work. It usually requires efficient intra-warp communication to be able to efficiently process input data which is now distributed among all threads within that warp. With the emergence of warp-wide voting and shuffle instructions, this approach has gained more potential in solving particular problems efficiently and with some benefits compared to the per-thread assignment and processing. In this work, we use this approach to implement a series of parallel algorithms: histogram, multisplit and radix sort.
An advantage of using similar granularities for work assignment and processing is in problems with uniform per-thread or per-warp workloads, where it is quite easy to adapt warp-synchronous ideas and achieve high performance.However, with non-uniform irregular workloads, different threads might finish their processing in different times which can cause a sub-par performance. This is mainly because the whole warp continues to be resident in the device as long as all its threads are finished.With these irregular tasks in mind, we propose to use different granularities for our work assignment and processing.We use a per-thread work assignment followed by a per-warp processing; each thread is still responsible for an independent task, but now all threads within a warp cooperate with each other to perform all these tasks together, one at a time, until all are successfully processed.Using this strategy, we design a dynamic hash table for the GPU, the slab hash, which is a totally concurrent data structure supporting asynchronous updates and search queries: threads may have different operations to perform and each might require an unknown amount of time to be fulfilled. By following our warp-cooperative strategy, all threads help each other perform these operations together, causing a much higher warp efficiency compared to traditional conflated work assignment and processing schemes.