Automatically Tuning Collective Communication for One-Sided Programming Models

2009

Abstract

Technology trends suggest that future machines will rely

on parallelism to meet increasing performance requirements. To aid in programmer productivity and application performance, many

parallel programming models provide communication building

blocks called collective communication. These

operations, such as Broadcast, Scatter, Gather, and Reduce, abstract

common global data movement patterns behind a simple library

interface allowing the hardware and runtime system to optimize them for performance and scalability.

We consider the problem of optimizing collective communication in Partitioned Global Address Space (PGAS) languages. Rooted in traditional

shared memory programming models, they deliver the benefits of

sophisticated distributed data structures using language extensions

and one-sided communication.

One-sided communication allows one processor to directly read and

write memory associated with another.

Many popular PGAS language implementations share a common runtime

system called GASNet for implementing such communication. To provide a highly scalable

platform for our work, we present a new implementation of

GASNet for the IBM BlueGene/P, allowing GASNet to scale to tens of thousands of processors.

We demonstrate that PGAS languages are highly scalable and that

the one-sided communication within them is an efficient and

convenient platform for collective communication. We show how to use one-sided communication to achieve 3x improvements in the latency and

throughput of the collectives over standard message passing implementations.

Using a 3D FFT as a representative communication

bound benchmark, for example, we see a 17% increase in performance on 32,768

cores of the BlueGene/P and a 1.5x improvement on 1024 cores of the CrayXT4. We also show how the

automatically tuned collectives can deliver more than an order of

magnitude in performance over existing implementations on shared

memory platforms.

There is no obvious

best algorithm that serves all machines and usage patterns

demonstrating the need for tuning and we thus build

an automatic tuning system in GASNet

that optimizes the collectives for a variety of large scale

supercomputers and novel multicore architectures. To understand the large search space, we

construct analytic performance models use them to minimize the

overhead of autotuning. We demonstrate that autotuning is

an effective approach to addressing performance optimizations on complex parallel systems.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Berkeley