Skip to main content
eScholarship
Open Access Publications from the University of California

Optimizing Access to Scientific Data for Storage, Analysis and Visualization

  • Author(s): Ionkov, Latchesar
  • Advisor(s): Maltzahn, Carlos
  • et al.
Creative Commons Attribution-ShareAlike 4.0 International Public License
Abstract

Scientific workflows contain an increasing number of interacting

applications, often with big disparity between the formats of data

being produced and consumed by different applications. This mismatch

can result in performance degradation as data retrieval causes

multiple read operations (often to a remote storage system) in order

to convert the data. In recent years, with the large increase in the

amount of data and computational power available there is demand for

applications to support data access in-situ, or close-to simulation to

provide application steering, analytics and visualization.

Although some parallel filesystems and middleware

libraries attempt to identify access patterns and optimize data

retrieval, they frequently fail if the patterns are complex. It is

evident that more knowledge of the structure of the datasets at the

storage systems level will provide many opportunities for further

performance improvements.

For most developers of scientific applications, storing the

application data, and its particular format on disk, is not an

essential part of the application. Although they acknowledge the

importance of the I/O performance, their expertise lies mostly in

numerical simulations and the particular models their application

simulates. Most of their efforts are spent of ensuring that the

it produces correct numerical results. Ideally, they would like to be

able to have a library call that reads a subset of the data from storage (no

matter what its format is), and place it in the data structures the

simulation defines in the computer memory. Since the data needs to be

analyzed and visualized, and the data has to be accessible from

third-party tools, the scientists are forced to know more about the

data formats.

In this dissertation we investigate multiple techniques for utilizing

dataset description for improving performance and overall data

availability for HPC applications. We introduce a declarative data

description language that can be used to define the complete dataset

as well as parts of it. These descriptions are used to generate

transformation rules that allow data to be converted between different

physical layouts on storage and in memory.

First, we define the DRepl dataset description language and use it to

implement divergent data views and replicas as POSIX files. We

evaluate the performance for this approach and demonstrate its

advantages both because of the transparent application use, and

combined performance when the application is combined with analytics

and/or visualization code that reads the data in different format.

DRepl decouples the data producers and consumers and the data layouts

they use from the way the data is stored on the storage system.

DRepl has shown up to 2x for cumulative performance when data is

accessed using optimized replicas.

Second, we extend the previous approach to the parallel environment

used in HPC. Instead of using POSIX files, the new method allows data

to be accessed in larger chunks (fragments) in the way it will be laid

out in memory. The developers can define what data structures they

have in the process' memory and the overall format of the dataset on

storage, and the runtime will automatically take care of transforming

the data between the two. Both the formats in memory and on disk are

described with the DRepl language. Replacing the ability for reading

the data as an array of bytes with operations that use descriptions of

the data structure, provides better opportunities for the

storage system to optimize the access to the persistent data. The

integration of this technique in Ceph demonstrates the potential

advantages for this approach. The experiments show performance

improvements up to 5 times for writes and 10 times for reads, compared

to collective MPI I/O.

Third, we explore the future directions of extending the DRepl

language to support more complex datasets. The additions would allow

scientists to use different resolutions for different parts of a

multi-dimensional spaces, and define how to transform the data between

resolutions. The changes would also allow completely abstract

definitions of datasets not only for continuums, but also for

primitive types like real and integer numbers. The fragments of the

dataset that are present in memory or disk will have concrete

types that are compatible with the abstract types used in the dataset.

Finally, we provide foundations on how to extend the previous

functionality to the most complicated data structures used in

scientific applications -- unstructured meshes.

Main Content
Current View