Modern high end computing systems store hundreds of petabytes of data and have billions of files, as many files as the internet of only a few years ago. Even modern personal computers store numbers of files that would be massive for the largest mainframe computers of forty years ago. The quantities of data in modern computing have long since overwhelmed anyone's ability to manage it manually, and the forty year old tools currently in use for file finding and management are reaching the limits of scale. In an environment like this, secure, effective, and efficient search algorithms and automatic file management become a necessity, not a nicety.
This dissertation addresses the question of how users can better find and manage files by taking advantage of advances in file systems. We focus on a multi-user scientific computing environment, but many of the techniques we describe are effective and advantageous at desktop scale as well. We begin with an empirical description of the problem, drawn from user studies and our statistical analysis of scientific data, in order to better understand the problem domain. We then describe a new technique for studying provenance in scientific systems, and a technique to synthesize system level provenance from existing traces. We describe our novel algorithm designed to provide importance ranking for file system search by leveraging provenance, and discuss the relationship between ranking and access prediction. And finally, we show how rich metadata can be used to improve file management by automatically generating expressive, unique file names.
Modern file management must be automatic and scalable, allowing users and file systems to focus on what each does best. By exploiting richer information such as provenance and semantic metadata, file systems can offer sophisticated new capabilities to ease the burdens of users, making file systems easier to use, navigate, and understand.