Search

Scholarly Works (2 results)

Article
Peer Reviewed

Performance characterization of scientific workflows for the optimal use of Burst Buffers

LBL Publications (2020)

Scientific discoveries are increasingly dependent upon the analysis of large volumes of data from observations and simulations of complex phenomena. Scientists compose the complex analyses as workflows and execute them on large-scale HPC systems. The workflow structures are in contrast with monolithic single simulations that have often been the primary use case on HPC systems. Simultaneously, new storage paradigms such as Burst Buffers are becoming available on HPC platforms. In this paper, we analyze the performance characteristics of a Burst Buffer and two representative scientific workflows with the aim of optimizing the usage of a Burst Buffer, extending our previous analyses (Daley et al., 2016). Our key contributions are (a) developing a performance analysis methodology pertinent to Burst Buffers, (b) improving the use of a Burst Buffer in workflows with bandwidth-sensitive and metadata-sensitive I/O workloads, (c) highlighting the key data management challenges when incorporating a Burst Buffer in the studied scientific workflows.

Cover page: Performance characterization of scientific workflows for the optimal use of Burst Buffers

Peer Reviewed

Community Access to MODIS Satellite Reprojection and Reduction Pipeline and Data Sets

LBL Publications (2012)

Moderate Resolution Imaging Spectroradiometer (MODIS), the key instrument aboard NASA's Terra and Aqua satellites, continuously generates data as the satellites cover the entire surface of earth every one to two days. This data is important to many scientific analyses, however, data procurement and processing can be challenging and cumbersome for user communities. Our current work is focused on enabling calculations using a combination of land and atmosphere products over land. Before performing the calculation the data must be downloaded and transformed, from a swath space and time system to a sinusoidal tiling system. Downloading data for a single product for an entire year can take several days for a single product and involves downloading via FTP many small files (on average ~83,000 files) in hierarchical data format (HDF4). The data processing, a swath-to-sinusoidal reprojection, is computationally intensive and currently available community tools only work for single sinusoidal tiles. We have developed a data-processing pipeline that downloads the MODIS products and reprojects them on HPC systems. HPC systems do not traditionally run these high-throughput data-intensive jobs and hence we need to address unique challenges for our pipeline. The first stage in the pipeline uses a catalog to determine what files need to be downloaded and downloads identified data sets. The downloaded files will in the future trigger an event that causes the reprojection job to be entered into a job queue. The output data is stored in an archival system. The resulting reprojected data will soon be widely available to the community through a front-end web portal. The portal will allow users to download reprojected data (~1 TB/year) for the following land and atmosphere products: MODO4_L2 (Aerosol), MOD05_L2 (Water Vapor), MOD06_L2 (Cloud), MOD07_L2 (Atmosphere Profile) and MOD11_L2 (Land Surface Temperature Emissivity). In this talk we will describe the architecture of the overall system and pipeline. Our long term plan is to allow users to reproject data on-demand and/or run algorithms on the reprojected MODIS data such as an evapotranspiration calculation.

Cover page: Community Access to MODIS Satellite Reprojection and Reduction Pipeline and Data Sets