I/O Optimization in Big Data Storage Systems
The age of Big data has transformed into the era of Internet of Things (IoT) where massive scale data is generated, stored, and used by a diverse set of physical objects: devices, vehicles, buildings, software, sensors, GPS and networks. It has become an open challenge for researchers in academia and industry to find the best ways to ingest, replicate, manage, read and deliver this massively growing data efficiently to millions of users in real time. Big data storage systems -- especially NoSQL databases like LevelDB, Cassandra, BigTable and AsterixDB -- have become extremely popular in the last decade for managing large amounts of data that don't require the stringent concurrency or transaction management guarantees.
In such settings, NoSQL systems achieve high rates of data writes. My research interests focus on Input/Output (I/O) optimizations of such state-of-the-art big data storage systems. Specially my thesis aims mainly at three aspects of optimization: Indexing, Partitioning and Replication.
a) Indexing of non-key attributes: Current state-of-the-art big data storage systems have limited support for
secondary attribute lookup queries or continuous lookup queries. To tackle these limitations, first we introduce and implement five secondary indexes on a NoSQL database. Specifically, we use the popular LevelDB database, which employs Log-Structured Merge-Tree (LSM) for organizing its data.
Our comprehensive experimental study and theoretical evaluation provide empirical guidelines for optimal choice of secondary index, depending on the workload of different applications.
b) Indexing for publish-subscribe systems: We propose and compare several publish/subscribe storage architectures, based on the popular NoSQL LSM storage paradigm, to support high-throughput and highly dynamic continuous lookup queries. Our framework naturally supports subscriptions on both historic and future streaming data, and generates instant notifications.
c) Data partitioning: We create optimization techniques for spatial indexes via intelligent partitioning. Currently NoSQL based databases do not offer any spatial partitioning to achieve faster spatial query response. We propose a level-based organization of disk components and two novel component merge techniques that leverage their spatial properties.
d) Data replication: Another important feature of big storage systems is its availability and reliability, which is achieved through replication. Paxos is a widely used replication policy to ensure the replicas are in sync. We develop an I/O optimized Paxos-based fault-tolerant block storage replication engine.