Skip to main content
eScholarship
Open Access Publications from the University of California

College of Engineering

Electrical & Computer Engineering bannerUC Davis

The UC Davis College of Engineering is comprised of 7 Academic Departments including: Biological & Agricultural, Biomedical, Chemical and Materials Science, Civil and Environmental, Computer Science, Electrical and Computer, and Mechanical and Aerospace Engineering.

http://engineering.ucdavis.edu

Cover page of Maximum Clique Enumeration on the GPU

Maximum Clique Enumeration on the GPU

(2023)

We present an iterative breadth-first approach to maximum clique enumeration on the GPU. The memory required to store all of the intermediate clique candidates poses a significant challenge. To mitigate this issue, we employ a variety of strategies to prune away non-maximum candidates and present a thorough examination of the performance and memory benefits of each of these options. We also explore a windowing strategy as a middle-ground between breadth-first and depth-first approaches, and investigate the resulting tradeoff between parallel efficiency and memory usage. Our results demonstrate that when we are able to manage the memory requirements, our approach achieves high throughput for large graphs indicating this approach is a good choice for GPU performance. We demonstrate an average speedup of 1.9x over previous parallel work, and obtain our best performance on graphs with low average degree.

Cover page of Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

(2022)

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggest the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.

Cover page of Neon: A Multi-GPU Programming Model for Grid-based Computations

Neon: A Multi-GPU Programming Model for Grid-based Computations

(2022)

We present Neon, a new programming model for grid-based computation with an intuitive, easy-to-use interface that allows domain experts to take full advantage of single-node multi-GPU systems. Neon decouples data structure from computation and back end configurations, allowing the same user code to operate on a variety of data structures and devices. Neon relies on a set of hierarchical abstractions that allow the user to write their applications as if they were sequential applications, while the runtime handles distribution across multiple GPUs and performs optimizations such as overlapping computation and communication without user intervention. We evaluate our programming model on several applications: a Lattice Boltzmann fluid solver, a finite-difference Poisson solver and a finite-element linear elastic solver. We show that these applications can be implemented concisely and scale well with the number of GPUs—achieving more than 99% of ideal efficiency.

Cover page of RXMesh: A GPU Mesh Data Structure

RXMesh: A GPU Mesh Data Structure

(2021)

We propose a new static high-performance mesh data structure for triangle surface meshes on the GPU. Our data structure is carefully designed for parallel execution while capturing mesh locality and confining data access, as much as possible, within the GPU's fast shared memory. We achieve this by subdividing the mesh into patches and representing these patches compactly using a matrix-based representation. Our patching technique is decorated with ribbons, thin mesh strips around patches that eliminate the need to communicate between different computation thread blocks, resulting in consistent high throughput. We call our data structure RXMesh: Ribbon-matriX Mesh. We hide the complexity of our data structure behind a flexible but powerful programming model that helps deliver high performance by inducing load balance even in highly irregular input meshes. We show the efficacy of our programming model on common geometry processing applications—mesh smoothing and filtering, geodesic distance, and vertex normal computation. For evaluation, we benchmark our data structure against well-optimized GPU and (single and multi-core) CPU data structures and show significant speedups.

Cover page of Dynamic Graphs on the GPU

Dynamic Graphs on the GPU

(2020)

We present a fast dynamic graph data structure for the GPU. Our dynamic graph structure uses one hash table per vertex to store adjacency lists and achieves 3.4–14.8x faster insertion rates over the state of the art across a diverse set of large datasets, as well as deletion speedups up to 7.8x. The data structure supports queries and dynamic updates through both edge and vertex insertion and deletion. In addition, we define a comprehensive evaluation strategy based on operations, workloads, and applications that we believe better characterize and evaluate dynamic graph data structures.

Cover page of Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection

Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection

(2019)

We benchmark several widely-used deep learning frameworks and investigate the FPGA deployment for performing traffic sign classification and detection. We evaluate the training speed and inference accuracy of these frameworks on the GPU by training FPGA-deployment-suitable models with various input sizes on GTSRB, a traffic sign classification dataset. Then, selected trained classification models and various object detection models that we train on GTSRB's detection counterpart (i.e., GTSDB) are evaluated with inference speed, accuracy, and FPGA power efficiency by varying different parameters such as floating-point precisions, batch sizes, etc. We discover that Neon and MXNet deliver the best training speed and classification accuracy on the GPU in general for all test cases, while TensorFlow is always among the frameworks with the highest inference accuracies. We observe that with the current OpenVINO release, the performance of lightweight models (e.g., MobileNet-v1-SSD, etc) usually exceeds the requirement of real-time detection without losing much accuracy, while other models (e.g., VGG-SSD, ResNet-50-SSD) generally fail to do so. We also demonstrate that we can adjust the precision of bitstreams and the batch sizes to balance inference speed and accuracy of the applications deployed on the FPGA. Finally, we show that for all test cases, the FPGA always achieves higher power efficiency than the GPU.

Cover page of Graph Coloring on the GPU

Graph Coloring on the GPU

(2019)

We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one datacentric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-theart implementations on real-world datasets. Our GraphBLAS implementation of Luby’s algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x.

Cover page of Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset

Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset

(2018)

We benchmark several widely used deep-learning frameworks for performing deep-learning-related automotive tasks (e.g., traffic sign recognition) that need to achieve realtime and high accuracy results with limited resources available on embedded platforms such as FPGAs. In our benchmarks, we use various input image sizes on models that are suitable for FPGA deployment, and investigate the training speed and inference accuracy of selected frameworks for these different sizes on a popular traffic sign recognition dataset. We report results by running the frameworks solely on the CPU as well as by turning on GPU acceleration. We also provide optimizations we apply to fine-tune the performance of the frameworks. We discover that Neon and MXNet deliver the best training speed and inference accuracy in general for all our test cases, while Tensorflow is always among the frameworks with the highest inference accuracies. We also observe that on the particular dataset we tested on (i.e., GTSRB), the image size of the region of interest does not necessarily affect the inference accuracy, and that using deep models, e.g., ResNet-32, which have longer training times, might not provide improvements to inference accuracy.