Designing Efficient and Resilient Lossy Compressors for Large-Scale Scientific Computing
Extremely large scale scientific simulation applications have been very important in many scientific domains including cosmology, climate, fluid dynamics, chemistry and so on. It has been shown that running the simulations at a larger scale can bring more discoveries. On one hand, with the increasing scale of those applications, the saturated I/O bandwidth can slow down the execution of the simulation significantly because of the huge amount of data needed to be dumped to the storage system. On the other hand, soft errors striking the simulations are not negligible considering the great number of components in the supercomputer and a single scientific execution spending days to finish. Therefore, it is meaningful to reduce the I/O time and harden the resilience of those large scale simulations. Though hardware solutions like designing new storage systems or error resilient computing devices have great generality, it usually takes longer development time and much more effort than software solutions. This thesis seeks software solutions by designing efficient and resilient lossy compressors for large scale scientific simulations.
To improve the overall simulation performance, we propose a better lossy compressor which has a much higher compression ratio to reduce the I/O time significantly. More specifically, we focus on particle based scientific simulations. As we know, greater compression ratios imply less data to be written to the storage system which in turn, reduces I/O time. The state-of-art lossy compressor takes the advantage of spatial smoothness to achieve high compression ratios. However, particle based simulations have very limited smoothness in space which leads to inadequate compression ratios. In contrast, we propose to exploit smoothness in time for lossy compression and design an optimized compression model based on the existing lossy compressor. Results show our optimized compression model achieves much better compression ratios and significantly reduces I/O time at large scale.
To improve the resilience of the simulation applications equipped with lossy compression, we design soft error resilient schemes for lossy compressors. First, we provide an algorithm-scope protection for one widely used lossy compressor named SZ. Then, we provide an application-scope protection that can be applied to all error-bounded lossy compressors. The algorithm-scope protection can only cover soft errors happening during the execution of the lossy compression itself while the application-scope protection can cover soft errors during simulation, lossy compression and even data writing. Both the algorithm- scope and the application-scope protections can provide significantly better resilience but keep the performance overhead low.