The graphics processing unit (GPU) was initially designed for raster-based graphics com- putations, but marked improvements in performance and programmability have generated considerable interest in it as a high-performance computing platform. GPU hardware design places a large number of low-power cores on the one chip, as opposed to multiple high-power chips. This configuration has a powerful effect on locality: data can be processed without the throughput costs of spilling it to buffers in the higher levels of the memory hierarchy. However, throughput-oriented algorithms must now explicitly express and optimize memory access.
This dissertation explores the design and deployment of memory-efficient tensor product representations for throughput-oriented processors. These representations lead naturally to algorithms with excellent data-locality and can be used to construct efficient and scalable represenations for hierarchical tensor-based algorithms. In addition to presenting the under- lying mathematical framework of tensor product representation, we present applications to several important numerical algorithms.
Our first contribution is a memory-efficient parallel Inverse Discrete Wavelet Transform algorithm. We present test results from an implementation that performs extremely well on commodity desktop GPU hardware. The small memory footprint of our representation contributes to an efficient data decompression algorithms and ultimately to high quality image analysis software.
We also consider the general problems associated with the efficient deployment of pro- cedures that utilize tensor product representations. We propose an abstract approach to heterogenous cluster application programming, which combines throughput-oriented algo- rithmic primitives and an integrated work-partitioning mechanism. We then present an experimental OpenCL-based implementation and test it on an N-body simulation in 2-d using the Barnes-Hut algorithm. Our implementation demonstrates good performance on a heterogeneous cluster of GPU-equipped machines.
Finally we present a general framework for creating tensor product representations of integral kernels in d dimensions. We conduct a detailed analysis of the memory complexity characteristics of the Barnes-Hut and Fast Multipole Methods in the general d-dimensional case, again using tensor product representations as primitives. We then present performance benchmarks for the two algorithms, and verify our theoretical results with experimentally measured values.