Optimizing Irregular Data Accesses for Cluster and Multicore Architectures
- Author(s): Su, Jimmy Zhigang
- Advisor(s): Yelick, Katherine A
- et al.
Applications with irregular accesses to shared state are one of the most challenging computational patterns in parallel computing. Accesses can involve both read or write operations, with writes having the additional complexity of requiring some form of synchronization. Irregular accesses perform poorly in local cached-based memory systems and across networks in global distributed memory settings, because they have poor spatial and temporal locality. Irregular accesses arises in transaction processing, in various system level programs, in computing histograms, performing sparse matrix operations, updating meshes in
particle-mesh methods, and building adaptive unstructured meshes. Writing codes with asynchronous parallel updates on clusters and multicore processors presents different sets of challenges. On clusters, the goal is to minimize the number of messages and the volume of messages between nodes. While on multicore machines, the goal is to minimize off-chip accesses since there is significant performance difference between on chip and off chip memory access.
In this dissertation, we explore various analyses, optimizations, and tools for shared accesses on both multicore and distributed memory cluster architectures. On cluster architectures, we consider both irregular reads and writes, demonstrate how Partitioned Global Address Space languages support programming irregular problems, and develop optimizations to minimize
communication traffic, both in volume and number of distinct events. On multicore processors, we consider the lower level code generation and tuning problem, independent of any particular source language. We explore performance tradeoffs between various shared update implementations, such as locking, replication of state to avoid collisions, and hybrid versions.
We develop an adaptive implementation that adjusts the shared update strategy based on densities that yields significant speedups. In addition, we develop a performance debugging tool to find scalability problems in large scientific applications early in the development cycle. Throughout the thesis we perform experiments demonstrating the value of our optimizations and tools in both architectural settings, use a set of benchmarks and applications that include
histogram making, sparse matrix computations, and two scientific simulations involving particle-mesh methods. Our results show substantial speeds of up to 4.8X for multicore platforms and 120X for clusters. The results are a comprehensive set of techniques for improving the performance of irregular applications using advanced languages, compilers, analyses, optimizations and tools.