Skip to main content
eScholarship
Open Access Publications from the University of California

Chunking of Large Multidimensional Arrays

Abstract

Data intensive scientific computations as well on-line analytical processing applications as are done on very large datasets that are modeled as k-dimensional arrays. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size hyper-rectangular sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that accesses all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give the minimum expected number of chunks over a query workload?" In this paper we develop two probabilistic mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic workloads on real life data sets, show that our chunking is much more efficient than the existing approximate solutions.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View