In recent years, the increasing complexity in scientific simulations and
emerging demands for training heavy artificial intelligence models require
massive and fast data accesses, which urges high-performance computing (HPC)
platforms to equip with more advanced storage infrastructures such as
solid-state disks (SSDs). While SSDs offer high-performance I/O, the
reliability challenges faced by the HPC applications under the SSD-related
failures remains unclear, in particular for failures resulting in data
corruptions. The goal of this paper is to understand the impact of SSD-related
faults on the behaviors of complex HPC applications. To this end, we propose
FFIS, a FUSE-based fault injection framework that systematically introduces
storage faults into the application layer to model the errors originated from
SSDs. FFIS is able to plant different I/O related faults into the data returned
from underlying file systems, which enables the investigation on the error
resilience characteristics of the scientific file format. We demonstrate the
use of FFIS with three representative real HPC applications, showing how each
application reacts to the data corruptions, and provide insights on the error
resilience of the widely adopted HDF5 file format for the HPC applications.