Chunking of Large Multidimensional Arrays
Data intensive scientific computations as well on-line analytical processing applications as are done on very large datasets that are modeled as k-dimensional arrays. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size hyper-rectangular sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that accesses all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give the minimum expected number of chunks over a query workload?" In this paper we develop two probabilistic mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic workloads on real life data sets, show that our chunking is much more efficient than the existing approximate solutions.