The cost of an algorithm includes both arithmetic and communication.
We use "communication" in a general sense to mean data movement,
either between levels of a memory hierarchy ("sequential") or
between processors ("parallel"). Communication costs include both
bandwidth terms, which are proportional to the number of words sent,
and latency terms, which are proportional to the number of messages in
which the data is sent. Communication costs are much higher than
arithmetic costs, and the gap is increasing rapidly for technological
reasons. This suggests that for best performance, algorithms should
minimize communication, even if that may require some redundant
arithmetic operations. We call such algorithms
"communication-avoiding."
Krylov subspace methods (KSMs) are iterative algorithms for solving
large, sparse linear systems and eigenvalue problems. Current KSMs
rely on sparse matrix-vector multiply (SpMV) and vector-vector
operations (like dot products and vector sums). All of these
operations are communication-bound. Furthermore, data dependencies
between them mean that only a small amount of that communication can
be hidden. Many important scientific and engineering computations
spend much of their time in Krylov methods, so the performance of many
codes could be improved by introducing KSMs that communicate less.
Our goal is to take s steps of a KSM for the same communication cost
as 1 step, which would be optimal. We call the resulting KSMs
"communication-avoiding Krylov methods." We can do this under certain
assumptions on the matrix, and for certain KSMs, both in theory (for
many KSMs) and in practice (for some KSMs). Our algorithms are based
on the so-called "s-step" Krylov methods, which break up the data
dependency between the sparse matrix-vector multiply and the dot
products in standard Krylov methods. This idea has been around for a
while, but in contrast to prior work (discussed in detail in Section
1.5), this thesis makes the following contributions:
We have fast kernels replacing SpMV, that can compute the
results of s calls to SpMV for the same communication cost as one
call (Section 2.1).
We have fast dense kernels as well, such as Tall Skinny QR (TSQR
-- Section 2.3) and Block Gram-Schmidt (BGS -- Section 2.4), which
can do the work of Modified Gram-Schmidt applied to s vectors for a
factor of Theta(s^2) fewer messages in parallel, and a factor of
Theta(s/W) fewer words transferred between levels of the memory
hierarchy (where W is the fast memory capacity in words).
We have new communication-avoiding Block Gram-Schmidt algorithms
for orthogonalization in more general inner products (Section 2.5).
We have new communication-avoiding versions of the following
Krylov subspace methods for solving linear systems: the Generalized
Minimum Residual method (GMRES -- Section 3.4), both
unpreconditioned and preconditioned, and the Method of Conjugate
Gradients (CG), both unpreconditioned (Section 5.4) and
left-preconditioned (Section 5.5).
We have new communication-avoiding versions of the following Krylov
subspace methods for solving eigenvalue problems, both standard (Ax =
&lambda x, for a nonsingular matrix A) and "generalized" (Ax =
&lambda Mx, for nonsingular matrices A and M): Arnoldi iteration
(Section 3.3), and Lanczos iteration, both for Ax = &lambda x (Section
4.2) and Ax = &lambda M x (Section 4.3).
We propose techniques for developing communication-avoiding versions
of nonsymmetric Lanczos iteration (for solving nonsymmetric eigenvalue
problems Ax = &lambda x) and the Method of Biconjugate Gradients
(BiCG) for solving linear systems. See Chapter 6 for details.
We can combine more stable numerical formulations that use different
bases of Krylov subspaces with our techniques for avoiding
communication. For a discussion of different bases, see Chapter 7.
To see an example of how the choice of basis affects the formulation
of the Krylov method, see Section 3.2.2.
We have faster numerical formulations. For example, in our
communication-avoiding version of GMRES, CA-GMRES (see Section 3.4),
we can pick the restart length r independently of the s-step basis
length s. Experiments in Section 3.5.5 show that this ability
improves numerical stability. We show in Section 3.6.3 that it also
improves performance in practice, resulting in a 2.23 × speedup
in the CA-GMRES implementation described below.
We combine all of these numerical and performance techniques in a
shared-memory parallel implementation of our communication-avoiding
version of GMRES, CA-GMRES. Compared to a similarly highly optimized
version of standard GMRES, when both are running in parallel on 8
cores of an Intel Clovertown (see Appendix A), CA-GMRES achieves 4.3
× speedups over standard GMRES on standard sparse test matrices
(described in Appendix B.5). When both are running in parallel on 8
cores of an Intel Nehalem (see Appendix A), CA-GMRES achieves 4.1
× speedups. See Section 3.6 for performance results and Section
3.5 for corresponding numerical experiments. We first reported
performance results for this implementation on the Intel Clovertown
platform in Demmel et al. [78].
We have incorporated preconditioning into our methods. Note
that we have not yet developed practical communication-avoiding
preconditioners; this is future work. We have
accomplished the following:
* We show (in Sections 2.2 and 4.3) what the s-step basis should
compute in the preconditioned case for many different types of Krylov
methods and s-step bases. We explain why this is hard in Section 4.3.
* We have identified two different structures that a preconditioner
may have, in order to achieve the desired optimal reduction of
communication by a factor of s. See Section 2.2 for details.
We present a detailed survey of related work, including s-step KSMs
(Section 1.5, especially Table 1.1) and other techniques for reducing
the amount of communication in iterative methods (Section 1.6).