 Main
Flexible CrossSubspace Alignment Codes for Variable Coded Distributed Batch Matrix Multiplication
 Tauz, Lev
 Advisor(s): Dolecek, Lara
Abstract
Modern distributed systems suffer from a phenomenon known as stragglers where computation nodes either breakdown or are sufficiently slow, resulting in a large tail latency. Inspired by error correcting codes, researchers within the field of coded computation combat stragglers by cleverly encoding the data within the computations. One major endeavor is in the study of coded matrixmatrix multiplication where the task is to multiply two large matrices in a distributed manner. Most coded matrix computation work focuses on highly structured tasks which allows for easier code construction and analysis but limits the applicability for more general problems. For the first time, we consider the novel problem of multiplying many different matrices whose products may share matrices with no guaranteed regularity. Specifically, we consider the Variable Coded Distributed Batch Matrix Multiplication (VCDBMM) problem where the system is given two sets of matrices $\mathcal{A}=\{A_1,A_2,\dots,A_{\mathcal{A}}\}$ and $\mathcal{B}=\{B_1,B_2,\dots,B_{\mathcal{B}}\}$ and a set of computation goals $\mathcal{S} = \{(i_1,j_1),(i_2,j_2), \dots (i_{\mathcal{S}},j_{\mathcal{S}})\}$ and the objective is to calculate the matrix multiplication $A_iB_j$ for every $(i,j) \in \mathcal{S}$ in the presence of stragglers. Therefore, a good coding solution minimizes the recovery threshold (i.e., the number of workers that we need to wait for in order to compute the final output). Inspired by CrossSubspace Alignment Codes, we construct Flexible CrossSubspace Alignment Codes (FCSA) to solve the general VCDBMM problem. We provide two variants of FCSA codes that allow for a tradeoff between the encoding/decoding complexity and the recovery threshold. We demonstrate that both variants are within a factor of two optimal under certain system constraints. We also generalize FCSA codes into Grouped FCSA codes where we group computations together to provide further flexibility between the computational complexity at the workers and the recovery threshold. We provide simulations on random instances of the VCDBMM problem and demonstrate the average improvement offered by our codes.
Main Content
Enter the password to open this PDF file:













