- Main
FT-RT-TDDFT: Fault Tolerant Real-Time Time-Dependent Density Functional Theory on High Performance Computing Systems
- Suvarna, Vineeth Bhaskar
- Advisor(s): Chen, Zizhong
Abstract
HPC systems are continuously experiencing exponential growth in their scale. The issue of fault tolerance in these systems is becoming increasingly important for applications like Real-Time Time-Dependent Density Functional Theory (RT-TDDFT) that run for extended periods. Checkpoint - restart is a common method to achieve fault tolerance in HPC systems. In this thesis, we analyze the performance of single file checkpoint-restart implementation in RT-TDDFT where data is collectively checkpointed to a single file, and find that storing the checkpoints in persistent storage adds significant performance overhead. We demonstrate multi-file checkpoint-restart in RT-TDDFT by creating multiple checkpoint files to improve the performance of checkpointing. We further reduce the performance overhead using in-memory checkpoint-restart where checkpoints are stored in-memory instead of persistent storage. We perform a comparative analysis and show that significant performance gains are achieved using multi-file and in-memory checkpoint-restart over single file checkpoint-restart. In this way, we implement multi-file and in-memory checkpoint-restart for fault tolerant RT-TDDFT on high performance computing systems.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-