 Main
Provably Efficient Algorithms for Numerical Tensor Algebra
 Author(s): Solomonik, Edgar
 Advisor(s): Demmel, James
 et al.
Abstract
This thesis targets the design of parallelizable algorithms and communicationefficient parallel schedules for numerical linear algebra as well as computations with higherorder tensors. Communication is a growing bottleneck in the execution of most algorithms on parallel computers, which manifests itself as data movement both through the network connecting different processors and through the memory hierarchy of each processor as well as synchronization between processors. We provide a rigorous theoretical model of communication and derive lower bounds as well as algorithms in this model. Our analysis concerns two broad areas of linear algebra and of tensor contractions. We demonstrate the practical quality of the new theoreticallyimproved algorithms by presenting results which show that our implementations outperform standard libraries and traditional algorithms.
We model the costs associated with local computation, interprocessor communication and synchronization, as well as memory to cache data transfers of a parallel schedule based on the most expensive execution path in the schedule. We introduce a new technique for deriving lower bounds on tradeoffs between these costs and apply them to algorithms in both dense and sparse linear algebra as well as graph algorithms. These lower bounds are attained by what we refer to as 2.5D algorithms, which we give for matrix multiplication, Gaussian elimination, QR factorization, the symmetric eigenvalue problem, and the FloydWarshall allpairs shortestpaths algorithm. 2.5D algorithms achieve lower interprocessor bandwidth cost by exploiting auxiliary memory. Algorithms employing this technique are well known for matrix multiplication, and have been derived in the BSP model for LU and QR factorization, as well as the FloydWarshall algorithm. We introduce alternate versions of LU and QR algorithms which have measurable performance improvements over their BSP counterparts, and we give the first evaluations of their performance. We also explore networktopologyaware mapping on torus networks for matrix multiplication and LU, showing how 2.5D algorithms can efficiently exploit collective communication, as well as introducing an adaptation of Cannon's matrix multiplication algorithm that is better suited for torus networks with dimension larger than two. For the symmetric eigenvalue problem, we give the first 2.5D algorithms, additionally solving challenges with memorybandwidth efficiency that arise for this problem. We also give a new memorybandwidth efficient algorithm for Krylov subspace methods (repeated multiplication of a vector by a sparsematrix), which is motivated by the application of our lower bound techniques to this problem.
The latter half of the thesis contains algorithms for higherorder tensors, in particular tensor contractions. The motivating application for this work is the family of coupledcluster methods, which solve the manybody Schrödinger equation to provide a chemicallyaccurate model of the electronic structure of molecules and chemical reactions where electron correlation plays a significant role. The numerical computation of these methods is dominated in cost by contraction of antisymmetric tensors. We introduce Cyclops Tensor Framework, which provides an automated mechanism for networktopologyaware decomposition and redistribution of tensor data. It leverages 2.5D matrix multiplication to perform tensor contractions communicationefficiently. The framework is capable of exploiting symmetry and antisymmetry in tensors and utilizes a distributed packedsymmetric storage format. Finally, we consider a theoretically novel technique for exploiting tensor symmetry to lower the number of multiplications necessary to perform a contraction via computing some redundant terms that allow preservation of symmetry and then cancelling them out with loworder cost. We analyze the numerical stability and communication efficiency of this technique and give adaptations to antisymmetric and Hermitian matrices. This technique has promising potential for accelerating coupledcluster methods both in terms of computation and communication cost, and additionally provides a potential improvement for BLAS routines on complex matrices.
Main Content
Enter the password to open this PDF file:













