High Performance Vertex-Centric Graph Analytics on GPUs
- Author(s): Khorasani, Farzad
- Advisor(s): Gupta, Rajiv
- et al.
Massive parallel processing power of GPU's presents an attractive opportunity for accelerating large scale vertex-centric graph computations. However, the inherent irregularity and large sizes of real-world power law graphs creates many challenges. Lock-step execution by threads within a SIMD group restricts exploitable parallelism, the limited GPU's DRAM size restricts the sizes of graphs that can be offloaded to the GPU, and the limited inter-GPU communication bandwidth necessitates judicious use available bandwidth. This dissertation addresses all of these challenges.We present Warp Segmentation that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex while employing the compact CSR representation to maximize the graph size that can be held in GPU global memory. Prior works can either maximize graph sizes (e.g., VWC) or device utilization (e.g., CuSha). We scale graph processing over multiple GPUs via Vertex Refinement that dynamically collects and transfers only the updated boundary vertices leading to dramatically reduced amount of inter-GPU data transfer. Existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. Since processing all vertices at every iteration wastes much of GPU's computation power, we present a work-efficient solution that processes only those vertices during an iteration that were activated in the previous iteration. We employ an effective task expansion strategy that avoids intra-warp thread underutilization. For multi-GPU graph computation, we present permissive partitioning to dynamically balance load across GPUs. Also, as recording vertex activeness requires additional data structures, to manage the graph storage overhead, we introduce vertex grouping that enables trade-off between memory consumption and work efficiency. Finally, to apply the proposed solutions to other irregular applications, we generalize our techniques and present Collaborative Context Collection (CCC) and Collaborative Task Engagement (CTE). CCC is a software/compiler technique to enhance the SIMD-efficiency in loops containing thread divergence. CTE abstracts away the complexities of a rather complicated technique using a CUDA C++ device side template library and balances load across threads within a SIMD group.