Lossy Compression for Exascale Scientific Applications
Today's scientific simulations are producing vast volumes of data that cannot be stored and transferred efficiently because of limited memory capacity, storage capacity, and network bandwidth. The situation is getting worse over time because of the ever-increasing gap between relatively slow data transfer speed and fast-growing computation power in modern supercomputers. Error-bounded lossy compression is becoming one of the most critical techniques for resolving the big scientific data issue, in that it can significantly reduce the scientific data volume while guaranteeing that the reconstructed data is valid for users because of its compression-error-bounding feature.
This thesis proposes three new lossy compressors for scientific applications across different domains. The first compressor exploits effective strategies by using 2nd-order regression and 2nd-order Lorenzo predictors to improve the prediction accuracy of SZ2 which is one of the best lossy compressors. It also contains an efficient approach to select the best-fit parameter setting, by conducting a comprehensive priori compression quality analysis and exploiting an efficient online controlling mechanism.
The second compressor uses a dynamic spline interpolation approach with a series of optimization strategies to further improve the compression quality. On the one hand, cubic spline interpolation is included to represent high order data variation, which obtains much higher prediction accuracy over linear regression for datasets with a high-order variation. On the other hand, we derive the constant coefficients in our interpolation approach such that the coefficient storage overhead can be completely eliminated. We further propose a dynamic optimization strategy to select the best predictor between the interpolation approach and the multilevel Lorenzo predictor to improve the overall compression quality.
The third compressor specifically targets molecular dynamics (MD) simulation data. MD simulations can produce a large volume of data because they could involve trillions of atoms for hundreds of millions of snapshots. Traditional lossy compressors are not optimized for MD applications because of MD data's trajectory type and irregular shape. We propose the MDZ compressor which contains three methods to fully leverage the data characteristics in both spatial and temporal domains. An adaptive solution is provided to automatically select the best-fit method during runtime.