Tensors, which generalize matrices to more than two dimensions, are fundamental to many disciplines, such as scientific computing and machine learning. Improving the performance and scalability of tensor computation is essential to those domains. The recent advance of heterogeneous memory is promising to deliver large-scale, high-performance tensor computation. However, it is challenging to leverage memory heterogeneity because of performance disparity between memory components. Tensor computation, often characterized with irregular memory access patterns, large working set size, and unknown tensor dimension size, makes the usage of heterogeneous memory more challenging.
In this dissertation, we propose efficient and scalable heterogeneous memory systems for tensor computation to solve the challenges. The core innovation in our proposed systems is to introduce system-architecture-tensor co-designs, taking advantage of intersectional domain knowledge in runtime system policies, architecture characteristics, and tensor features. In particular, our approach takes into account runtime system policies (e.g., policies of data migration, prefetching, concurrency control), architecture characteristics (e.g., characteristics in emerging non-volatile memories, 3D-stacked memories, accelerators with massive parallelism), and tensor features (e.g., high data dimensionality, varying memory access patterns, irregular data distribution in the data structure) for tensor computation.
The evaluation results show that: (1) with evaluating various sparse tensor contraction datasets, our design brings 28−576 times speedup over the state-of-the-art sparse tensor contraction design; (2) with evaluating various sparse tensor contraction sequence datasets, our design brings 327-7362 times speedup over the state-of-the-art work; (3) with evaluating various tensor-based neural network training workloads, our design achieves up to 24 times and 4 times better energy consumption compared to CPU and GPU respectively; (4) with evaluating various tensor-based neural network training workloads, our design achieves up to 50% (33% on average) performance improvement compared to the state-of-the-art TensorFlow.