File caching in data intensive scientific applications
We present some theoretical and experimental results of an important caching problem that arises frequently in data intensive scientific applications. In such applications, jobs need to process several files simultaneously, i.e., a job can only be serviced if all its needed files are present in the disk cache. The set of files requested by a job is called a file-bundle. This requirement introduces the need for cache replacement algorithms based on file-bundles rather then individual files. We show that traditional caching algorithms such Least Recently Used (LRU), and GreedyDual-Size (GDS), are not optimal in this case since they are not sensitive to file-bundles and may hold in the cache non-relevant combinations of files. In this paper we propose and analyze a new cache replacement algorithm specifically adapted to deal with file-bundles. We tested the new algorithm using a disk cache simulation model under a wide range of parameters such as file request distributions, relative cache size, file size distribution,and queue size. In all these tests, the results show significant improvement over traditional caching algorithms such as GDS.