Contemporary parallel microprocessors exploit Chip Multiprocessing along with Single Instruction, Multiple Data parallelism to deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level efficiency languages such as C++ is often a difficult task, since the programmer is burdened with mapping data parallelism from an application onto the hardware structures designed to execute it. Languages specifically designed for data parallelism, such as NESL, aim to improve programmer productivity by allowing the programmer to express computation as a composition of data parallel primitives, such as map, reduce and scan. However, efficiently compiling nested data parallel computations to contemporary parallel microprocessors is challenging.
Additionally, the rise of productivity languages, such as Ruby and Python, has facilitated the construction of domain-specific embedded languages. These embedded languages employ familiar syntactic constructs, which eases the task of learning a new programming environment, while also retaining the full capabilities of the host language. This work capitalizes on the productivity of domain-specific embedded languages as well as the nested data parallel abstraction to create a programming environment that is both productive and efficient on contemporary parallel microprocessors. We describe Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer writes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis.
Compiling data parallel computations requires analyses and transformations to schedule data parallel operations onto a target platform. We discuss the language, compiler, and runtime features that enable Copperhead to efficiently do so. We define the restricted subset of Python that Copperhead supports and introduce the program analysis techniques and transformations necessary for compiling Copperhead code into efficient low-level implementations. We demonstrate that indiscriminate use of the flattening or vectorization transform, common to data parallel compilers, is inefficient on contemporary microprocessors, and we advocate a more direct mapping of nested data parallel operations onto the natural parallelism hierarchy provided by today's parallel platforms. We show how this direct mapping allows a data parallel compiler to capitalize on hierarchical on-chip memory structures, as well as perform data parallel primitive fusion in order to gain efficiency. We outline the runtime support that allows Copperhead programs to efficiently interoperate with standard Python modules.
We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on graphics processors. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient low-level implementations, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code. Copperhead provides an efficient and productive way to program parallel microprocessors.