Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

FT-RT-TDDFT: Fault Tolerant Real-Time Time-Dependent Density Functional Theory on High Performance Computing Systems

Abstract

HPC systems are continuously experiencing exponential growth in their scale. The issue of fault tolerance in these systems is becoming increasingly important for applications like Real-Time Time-Dependent Density Functional Theory (RT-TDDFT) that run for extended periods. Checkpoint - restart is a common method to achieve fault tolerance in HPC systems. In this thesis, we analyze the performance of single file checkpoint-restart implementation in RT-TDDFT where data is collectively checkpointed to a single file, and find that storing the checkpoints in persistent storage adds significant performance overhead. We demonstrate multi-file checkpoint-restart in RT-TDDFT by creating multiple checkpoint files to improve the performance of checkpointing. We further reduce the performance overhead using in-memory checkpoint-restart where checkpoints are stored in-memory instead of persistent storage. We perform a comparative analysis and show that significant performance gains are achieved using multi-file and in-memory checkpoint-restart over single file checkpoint-restart. In this way, we implement multi-file and in-memory checkpoint-restart for fault tolerant RT-TDDFT on high performance computing systems.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View