Skip to main content
Open Access Publications from the University of California

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies:

  • Author(s): Gittens, Alex
  • Devarakonda, Aditya
  • Racah, Evan
  • Ringenburg, Michael
  • Gerhardt, Lisa
  • Kottaalam, Jey
  • Liu, Jialin
  • Maschhoff, Kristyn
  • Canon, Shane
  • Chhugani, Jatin
  • Sharma, Pramod
  • Yang, Jiyan
  • Demmel, James
  • Harrell, Jim
  • Krishnamurthy, Venkat
  • Mahoney, Michael W.
  • Prabhat, Mr
  • et al.

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

Main Content
Current View