UC Santa Cruz
Management of High-Volume Real-Time Streaming Data in Transient Environments
- Author(s): Bigelow, David
- Advisor(s): Brandt, Scott
- et al.
In an information-driven world, the ability to capture and store data in real time is of the utmost importance. The scope and intent of such data capture, however, varies widely. Individuals record television programs for later viewing, governments maintain vast sensor networks to warn against calamity, scientists conduct experiments requiring immense data collection, and automated monitoring tools supervise a host of processes which human hands rarely touch. All such tasks have the same basic requirements --- guaranteed capture, management, storage, and analysis of streaming real-time data --- but with greatly differing parameters. Our ability to process and interpret data has grown faster than our ability to store and manage it, a characteristic which now hinders our ability to exploit it.
The work presented in this dissertation demonstrates a means of integrating data management with the physical storage layer in order to gain superior performance not otherwise achievable. By fusing a close understanding of the disk hardware with the necessary components of a high-performance storage system, a unique method of data handling is constructed. This approach allows for hard performance guarantees and quality of service regulation at near-maximum hardware capabilities in a transient data environment for indefinite periods of time. The core storage system is made to understand the data as more than just a stream of bytes, uniting indexing and query capabilities into basic operations. All such gains are fully compatible with the accoutrements of a large-scale storage system, such as reliability and control mechanisms, which results in a fully viable new storage architecture.