- Main

## Flexible Cross-Subspace Alignment Codes for Variable Coded Distributed Batch Matrix Multiplication

- Author(s): Tauz, Lev
- Advisor(s): Dolecek, Lara
- et al.

## Abstract

Modern distributed systems suffer from a phenomenon known as stragglers where computation nodes either break-down or are sufficiently slow, resulting in a large tail latency. Inspired by error correcting codes, researchers within the field of coded computation combat stragglers by cleverly encoding the data within the computations. One major endeavor is in the study of coded matrix-matrix multiplication where the task is to multiply two large matrices in a distributed manner. Most coded matrix computation work focuses on highly structured tasks which allows for easier code construction and analysis but limits the applicability for more general problems. For the first time, we consider the novel problem of multiplying many different matrices whose products may share matrices with no guaranteed regularity. Specifically, we consider the Variable Coded Distributed Batch Matrix Multiplication (VCDBMM) problem where the system is given two sets of matrices $\mathcal{A}=\{A_1,A_2,\dots,A_{|\mathcal{A}|}\}$ and $\mathcal{B}=\{B_1,B_2,\dots,B_{|\mathcal{B}|}\}$ and a set of computation goals $\mathcal{S} = \{(i_1,j_1),(i_2,j_2), \dots (i_{|\mathcal{S}|},j_{|\mathcal{S}|})\}$ and the objective is to calculate the matrix multiplication $A_iB_j$ for every $(i,j) \in \mathcal{S}$ in the presence of stragglers. Therefore, a good coding solution minimizes the recovery threshold (i.e., the number of workers that we need to wait for in order to compute the final output). Inspired by Cross-Subspace Alignment Codes, we construct Flexible Cross-Subspace Alignment Codes (FCSA) to solve the general VCDBMM problem. We provide two variants of FCSA codes that allow for a trade-off between the encoding/decoding complexity and the recovery threshold. We demonstrate that both variants are within a factor of two optimal under certain system constraints. We also generalize FCSA codes into Grouped FCSA codes where we group computations together to provide further flexibility between the computational complexity at the workers and the recovery threshold. We provide simulations on random instances of the VCDBMM problem and demonstrate the average improvement offered by our codes.