Recent studies have identified disk storage systems as one of the major consumers of power in data centers. Many disk power management (DPM) schemes were suggested where the power consumed by disks is reduced by spinning them down during long idle periods. Spinning the disks down and up results in additional energy and response time costs. For that reason, DPM schemes are effective only if the disks experience relatively long idle periods and the scheme does not introduce a severe response time penalty. In this paper we introduce a dynamic block exchange algorithm which switches data between disks based on the observed workload such that frequently accessed blocks end up residing on a few "hot" disks thus allowing the majority of disks to experience longer idle periods. We validate the effectiveness of the algorithm with trace-driven simulations showing power savings of up to 60percent with very small response time penalties.
Scientific data centers comprised of high-powered computing equipment and large capacity disk storage systems consume considerable amount of energy. Dynamic power management techniques (DPM) are commonly used for saving energy in disk systems. These involve powering down disks that exhibit long idle periods and placing them in standby mode. A file request from a disk in standby mode will incur both energy and performance penalties as it takes energy (and time) to spin up the disk before it can serve a file. For this reason, DPM has to make decisions as to when to transition the disk into standby mode such that the energy saved is greater than the energy needed to spin it up again and the performance penalty is tolerable. The length of the idle period until the DPM decides to power down a disk is called idleness threshold. In this paper, we study both analytically and experimentally dynamic power management techniques that save energy subject to performance constraints on file access costs. Based on observed workloads of scientific applications and disk characteristics, we provide a methodology for determining file assignment to disks and computing idleness thresholds that result in significant improvements to the energy saved by existing DPMsolutions while meeting response time constraints. We validate our methods with simulations that use traces taken from scientific applications.
Exponential data growth is a reality for most enterprise and scientific data centers. Improvements in price/performance and storage densities of disks have made it both easy and affordable to maintain most of the data in large disk storage farms. The provisioning of disk storage farms however, is at the expense of high energy consumption due to the large number of spinning disks. The power for spinning the disks and the associated cooling costs is a significant fraction of the total power consumption of a typical data center. Given the trend of rising global fuel and energy prices and the high rate of data growth, the challenge is to implement appropriate configurations of large scale disk storage systems that meet performance requirements for information retrieval across data centers. We present part of the solution to this challenge with an energy efficient file allocation strategy on a large scale disk storage system. Given performance characteristics of the disks, and a profile of the workload in terms of frequencies of file requests and their sizes, the basic idea is to allocate files to disks such that the disks can be configured into two sets of active (constantly spinning), and passive (capable of being spun up or down) disk pools. The goal is to minimize the number of active disks subject to I/O performance constraints. We present an algorithm for solving this problem with guaranteed bounds from the optimal solution. Our algorithm runs in O(n) time where n is the number of files allocated. It uses a mapping of our file allocation problem to a generalization of the bin packing problem known as 2-dimensional vector packing. Detailed simulation results are also provided.
It is anticipated that in the near future disk storage systems will surpass application servers and will become the primary consumer of power in the data centers. Shutting down of inactive disks is one of the more widespread solutions to save power consumption of disk systems. This solution involves spinning down or completely shutting off disks that exhibit long periods of inactivity and placing them in standby mode. A file request from a disk in standby mode will incur an I/O cost penalty as it takes time to spin up the disk before it can serve the file. In this paper, we address the problem of designing and implementing file allocation strategies on disk storage that save energy while meeting performance requirements of file retrievals. We present an algorithm for solving this problem with guaranteed bounds from the optimal solution. Our algorithm runs in O(nlogn) time where n is the number of files allocated. Detailed simulation results and experiments with real life workloads are also presented.
Datasets used in scientific and engineering applications are often modeled as dense multi-dimensional arrays. For very large datasets, the corresponding array models are typically stored out-of-core as array files. The array elements are mapped onto linear consecutive locations that correspond to the linear ordering of the multi-dimensional indices. Two conventional mappings used are the row-major order and the column-major order of multi-dimensional arrays. Such conventional mappings of dense array files highly limit the performance of applications and the extendibility of the dataset. Firstly, an array file that is organized in say row-major order causes applications that subsequently access the data in column-major order, to have abysmal performance. Secondly, any subsequent expansion of the array file is limited to only one dimension. Expansions of such out-of-core conventional arrays along arbitrary dimensions, require storage reorganization that can be very expensive. We present a solution for storing out-of-core dense extendible arrays that resolve the two limitations. The method uses a mapping function F*(), together with information maintained in axial vectors, to compute the linear address of an extendible array element when passed its k-dimensional index. We also give the inverse function, F-1*() for deriving the k-dimensional index when given the linear address. We show how the mapping function, in combination with MPI-IO and a parallel file system, allows for the growth of the extendible array without reorganization and no significant performance degradation of applications accessing elements in any desired order. We give methods for reading and writing sub-arrays into and out of parallel applications that run on a cluster of workstations. The axial-vectors are replicated and maintained in each node that accesses sub-array elements.
Very large multidimensional arrays are commonly used in data intensive scientific computations as well as on-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that access all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give the minimum expected number of chunks over a query workload?" The problem of optimal chunking was first introduced by Sarawagi and Stonebraker who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently within than 2.0percent of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of Sarawagi and Stonebraker can deviate considerably from the true result with increasing number of dimensions and also may lead to suboptimal chunk shapes.
Caching techniques have been used to improve the performance gap of storage hierarchies in computing systems. In data intensive applications that access large data files over wide area network environment, such as a data grid,caching mechanism can significantly improve the data access performance under appropriate workloads. In a data grid, it is envisioned that local disk storage resources retain or cache the data files being used by local application. Under a workload of shared access and high locality of reference, the performance of the caching techniques depends heavily on the replacement policies being used. A replacement policy effectively determines which set of objects must be evicted when space is needed. Unlike cache replacement policies in virtual memory paging or database buffering, developing an optimal replacement policy for data grids is complicated by the fact that the file objects being cached have varying sizes and varying transfer and processing costs that vary with time. We present an accurate model for evaluating various replacement policies and propose a new replacement algorithm referred to as "Least Cost Beneficial based on K backward references (LCB-K)." Using this modeling technique, we compare LCB-K with various replacement policies such as Least Frequently Used (LFU), Least Recently Used (LRU), Greedy DualSize (GDS), etc., using synthetic and actual workload of accesses to and from tertiary storage systems. The results obtained show that (LCB-K) and (GDS) are the most cost effective cache replacement policies for storage resource management in data grids.
Data intensive scientific computations as well on-line analytical processing applications as are done on very large datasets that are modeled as k-dimensional arrays. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size hyper-rectangular sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that accesses all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give the minimum expected number of chunks over a query workload?" In this paper we develop two probabilistic mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic workloads on real life data sets, show that our chunking is much more efficient than the existing approximate solutions.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.