## Communication-avoiding Krylov subspace methods

- Author(s): Hoemmen, Mark
- Advisor(s): Demmel, James W
- et al.

## Abstract

The cost of an algorithm includes both arithmetic and communication.

We use "communication" in a general sense to mean data movement,

either between levels of a memory hierarchy ("sequential") or

between processors ("parallel"). Communication costs include both

bandwidth terms, which are proportional to the number of words sent,

and latency terms, which are proportional to the number of messages in

which the data is sent. Communication costs are much higher than

arithmetic costs, and the gap is increasing rapidly for technological

reasons. This suggests that for best performance, algorithms should

minimize communication, even if that may require some redundant

arithmetic operations. We call such algorithms

"communication-avoiding."

Krylov subspace methods (KSMs) are iterative algorithms for solving

large, sparse linear systems and eigenvalue problems. Current KSMs

rely on sparse matrix-vector multiply (SpMV) and vector-vector

operations (like dot products and vector sums). All of these

operations are communication-bound. Furthermore, data dependencies

between them mean that only a small amount of that communication can

be hidden. Many important scientific and engineering computations

spend much of their time in Krylov methods, so the performance of many

codes could be improved by introducing KSMs that communicate less.

Our goal is to take s steps of a KSM for the same communication cost

as 1 step, which would be optimal. We call the resulting KSMs

"communication-avoiding Krylov methods." We can do this under certain

assumptions on the matrix, and for certain KSMs, both in theory (for

many KSMs) and in practice (for some KSMs). Our algorithms are based

on the so-called "s-step" Krylov methods, which break up the data

dependency between the sparse matrix-vector multiply and the dot

products in standard Krylov methods. This idea has been around for a

while, but in contrast to prior work (discussed in detail in Section

1.5), this thesis makes the following contributions:

We have fast kernels replacing SpMV, that can compute the

results of s calls to SpMV for the same communication cost as one

call (Section 2.1).

We have fast dense kernels as well, such as Tall Skinny QR (TSQR

-- Section 2.3) and Block Gram-Schmidt (BGS -- Section 2.4), which

can do the work of Modified Gram-Schmidt applied to s vectors for a

factor of Theta(s^2) fewer messages in parallel, and a factor of

Theta(s/W) fewer words transferred between levels of the memory

hierarchy (where W is the fast memory capacity in words).

We have new communication-avoiding Block Gram-Schmidt algorithms

for orthogonalization in more general inner products (Section 2.5).

We have new communication-avoiding versions of the following

Krylov subspace methods for solving linear systems: the Generalized

Minimum Residual method (GMRES -- Section 3.4), both

unpreconditioned and preconditioned, and the Method of Conjugate

Gradients (CG), both unpreconditioned (Section 5.4) and

left-preconditioned (Section 5.5).

We have new communication-avoiding versions of the following Krylov

subspace methods for solving eigenvalue problems, both standard (Ax =

&lambda x, for a nonsingular matrix A) and "generalized" (Ax =

&lambda Mx, for nonsingular matrices A and M): Arnoldi iteration

(Section 3.3), and Lanczos iteration, both for Ax = &lambda x (Section

4.2) and Ax = &lambda M x (Section 4.3).

We propose techniques for developing communication-avoiding versions

of nonsymmetric Lanczos iteration (for solving nonsymmetric eigenvalue

problems Ax = &lambda x) and the Method of Biconjugate Gradients

(BiCG) for solving linear systems. See Chapter 6 for details.

We can combine more stable numerical formulations that use different

bases of Krylov subspaces with our techniques for avoiding

communication. For a discussion of different bases, see Chapter 7.

To see an example of how the choice of basis affects the formulation

of the Krylov method, see Section 3.2.2.

We have faster numerical formulations. For example, in our

communication-avoiding version of GMRES, CA-GMRES (see Section 3.4),

we can pick the restart length r independently of the s-step basis

length s. Experiments in Section 3.5.5 show that this ability

improves numerical stability. We show in Section 3.6.3 that it also

improves performance in practice, resulting in a 2.23 × speedup

in the CA-GMRES implementation described below.

We combine all of these numerical and performance techniques in a

shared-memory parallel implementation of our communication-avoiding

version of GMRES, CA-GMRES. Compared to a similarly highly optimized

version of standard GMRES, when both are running in parallel on 8

cores of an Intel Clovertown (see Appendix A), CA-GMRES achieves 4.3

× speedups over standard GMRES on standard sparse test matrices

(described in Appendix B.5). When both are running in parallel on 8

cores of an Intel Nehalem (see Appendix A), CA-GMRES achieves 4.1

× speedups. See Section 3.6 for performance results and Section

3.5 for corresponding numerical experiments. We first reported

performance results for this implementation on the Intel Clovertown

platform in Demmel et al. [78].

We have incorporated preconditioning into our methods. Note

that we have not yet developed practical communication-avoiding

preconditioners; this is future work. We *have*

accomplished the following:

* We show (in Sections 2.2 and 4.3) what the s-step basis should

compute in the preconditioned case for many different types of Krylov

methods and s-step bases. We explain why this is hard in Section 4.3.

* We have identified two different structures that a preconditioner

may have, in order to achieve the desired optimal reduction of

communication by a factor of s. See Section 2.2 for details.

We present a detailed survey of related work, including s-step KSMs

(Section 1.5, especially Table 1.1) and other techniques for reducing

the amount of communication in iterative methods (Section 1.6).