ARCHIE: Data Analysis Acceleration with Array Caching in Hierarchical Storage
Published Web Location
https://sdm.lbl.gov/pdc/pubs/201812-BigData-ARCHIE.pdfAbstract
Scientific data analysis typically involves reading massive amounts of data that was generated by simulations, experiments, and observations. Performance of reading such large volumes of data from disk-based file systems is often poor because of the slow and mechanical components in the disks. Recent supercomputing systems are adding non-volatile storage layers in a hierarchy to handle the performance gap between fast main memory and slow disk-based storage. Software libraries for managing this hierarchy not only need efficient reading of data but also reduce user-involvement for cross-layer data movement. Furthermore, these libraries need to support array data access patterns into hierarchical storage management as scientific data is often organized in array-based data structures. Existing software typically manage individual storage layers requiring significant manual process in moving data among them. In this paper, we introduce a new array caching in hierarchical storage (ARCHIE) to accelerate array data analysis in a seamless fashion. ARCHIE evaluates array access patterns and prefetches data with array semantics between storage layers. Our evaluation shows that ARCHIE outperforms state-of-the-art file systems, i.e., Lustre and DataWarp, on a production supercomputing system by up to 5.8× in accessing data by scientific analysis applications.