Running programs across multiple nodes in a cluster of networked computers, such as in a supercomputer or commodity datacenter system,is increasingly important across multiple domains, including data science, machine learning, and scientific computing. This is brought on by a combination of increasing data sizes, which push beyond the memory capacity of a single node, and increasing computational demands from new, more elaborate simulations, models, and applications.
However, writing parallel programs for clusters of computers remains a difficult task, particularly for programs that are irregular in terms ofdata distribution or access pattern. Many parallel programs today are still written using communication libraries like MPI or OpenSHMEM, which require users to explicitly manage low-level details. While high-level parallel programming languages and libraries do exist, and these can make implementing certain types of programs much easier, developers often have to expend significant effort building custom infrastructure and data structures for their applications.
This thesis argues that a large part of the reason why parallel programming remains difficult is a lack of high-level distributed data structures analogous to the data structures that have become ubiquitous in sequential programming environments like C++ and Python. These especially include irregular data structures like hash tables and queues that may require fine-grained memory accesses along with synchronization. This thesis examines techniques for building high-level, cross-platform distributed data structures using one-sided remote memory operations like remote put, remote get, and remote atomics. These memory access primitives allow for a high degree of asynchrony, enabling better performance by removing synchronization bottlenecks and allowing a high degree of overlap between communication and computation. They can also be efficiently executed directly by the network hardware in modern supercomputer and commodity datacenter networks, removing the need to synchronize with remote processes.
This thesis examines several RDMA-based distributed data structures, including hash tables, Bloom filters, queues, and dense and sparse matrices. We provide a performance model for evaluating the cost of RDMA-based distributed data structure methods in terms of their component remote memory operations, and demonstrate how this model can be extended to support GPUs in addition to conventional CPUs.