Avoiding Communication in Dense Linear Algebra
Dense linear algebra computations are essential to nearly every problem in scientific computing and to countless other fields. Most matrix computations enjoy a high computational intensity (i.e., ratio of computation to data), and therefore the algorithms for the computations have a potential for high efficiency. However, performance for many linear algebra algorithms is limited by the cost of moving data between processors on a parallel computer or throughout the memory hierarchy of a single processor, which we will refer to generally as communication. Technological trends indicate that algorithmic performance will become even more limited by communication in the future. In this thesis, we consider the fundamental computations within dense linear algebra and address the following question: can we significantly improve the current algorithms for these computations, in terms of the communication they require and their performance in practice?
To answer the question, we analyze algorithms on sequential and parallel architectural models that are simple enough to determine coarse communication costs but accurate enough to predict performance of implementations on real hardware. For most of the computations, we prove lower bounds on the communication that any algorithm must perform. If an algorithm exists with communication costs that match the lower bounds (at least in an asymptotic sense), we call the algorithm communication optimal. In many cases, the most commonly used algorithms are not communication optimal, and we can develop new algorithms that require less data movement and attain the communication lower bounds.
In this thesis, we develop both new communication lower bounds and new algorithms, tightening (and in many cases closing) the gap between best known lower bound and best known algorithm (or upper bound). We consider both sequential and parallel algorithms, and we asses both classical and fast algorithms (e.g., Strassen's matrix multiplication algorithm).
In particular, the central contributions of this thesis are: proving new communication lower bounds for nearly all classical direct linear algebra computations (dense or sparse), including factorizations for solving linear systems, least squares problems, and eigenvalue and singular value problems; proving new communication lower bounds for Strassen's and other fast matrix multiplication algorithms; proving new parallel communication lower bounds for classical and fast computations that set limits on an algorithm's ability to perfectly strong scale; summarizing the state-of-the-art in communication efficiency for both sequential and parallel algorithms for the computations to which the lower bounds apply; developing a new communication-optimal algorithm for computing a symmetric-indefinite factorization (observing speedups of up to 2.8x compared to alternative shared-memory parallel algorithms); developing new, more communication-efficient algorithms for reducing a symmetric band matrix to tridiagonal form via orthogonal similar transformations (observing speedups of 2--6x compared to alternative sequential and parallel algorithms); and developing a new communication-optimal parallelization of Strassen's matrix multiplication algorithm (observing speedups of up to 2.84x compared to alternative distributed-memory parallel algorithms).