NERSC's Global File system (NGF), accessible from all compute systems at NERSC, holds files and data from many scientific projects. A full backup of this file system to our High Performance Storage System (HPSS) is performed periodically. Disk storage usage by projects at NERSC has grown seven fold over a two year period, from ~;;20TB in June 2006 to ~;;140 TB in June 2008. The latest full backup took about 13 days and more than 200 T10k tape cartridges (.5 TB capacity). Petabyte file systems are becoming a reality in the next few years and the existing utilities are already strained in handling backup tasks.
In order to address the needs of future scientific applications for storing and accessing large amounts of data in an efficient way, one needs to understand the limitations of current technologies and how they may cause system instability or unavailability. A number of factors can impact system availability ranging from facility-wide power outage to a single point of failure such as network switches or global file systems. In addition, individual component failure in a system can degrade the performance of that system. This paper focuses on analyzing both of these factors and their impacts on the computational and storage systems at NERSC. Component failure data presented in this report primarily focuses on disk drive in on of the computational system and tape drive failure in HPSS. NERSC collected available component failure data and system-wide outages for its computational and storage systems over a six-year period and made them available to the HPC community through the Petascale Data Storage Institute.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.