GPGPU parallel algorithms for structured-grid CFD codes
Published Web Locationhttps://doi.org/10.2514/6.2011-3221
A new high-performance general-purpose graphics processing unit (GPGPU) computational fluid dynamics (CFD) library is introduced for use with structured-grid CFD algorithms. A novel set of parallel tridiagonal matrix solvers, implemented in CUDA, is included for use with structured-grid CFD algorithms. The solver library supports both scalar and block-tridiagonal matrices suitable for approximate factorization (AF) schemes. The computational routines are designed for both GPU-based CFD codes or as a GPU accelerator for CPU-based algorithms. Additionally, the library includes, among others, a collection of finite-volume calculation routines for computing local and global stable time-steps, inviscid surface fluxes, and face/node/cell-centered interpolation on generalized 3D, multi-block structured grids. GPU block tridiagonal benchmarks showed a speed-up of 3.6x compared to an OpenMP CPU Thomas Algorithm results when host-device data transfers are removed. Detailed analysis shows that a structure-of-arrays (SOA) matrix storage format versus an array-of-structures (AOS) format on the GPU improved the parallel block- tridiagonal performance by a factor of 2.6x for the parallel cyclic reduction (PCR) algorithm. The GPU block tridiagonal solver was also applied to the OVERFLOW-2 CFD code. Performance measurements using synchronous and asynchronous data transfers within the OVERFLOW-2 code showed poorer performance compared to the cache-optimized CPU Thomas Algorithm. The poor performance was attributed to the significant cost of the rank-5 sub-matrix and sub-vector host-device data transfers and the matrix format conversion. The finite-volume maximum time-step and inviscid flux kernels were benchmarked within the MBFLO3 CFD code and showed speed-ups, including the cost of host-device memory transfers, ranging from 3.2--4.3x compared to optimized CPU code. It was determined, however, that GPU acceleration could be increased to 21x over a single CPU core if host-device data transfers could be eliminated or significantly reduced.