Activating Big Data at Scale
- Author(s): Wang, Xikui
- Advisor(s): Carey, Michael J.
- et al.
With our world being more digitized than ever, handling Big Data has become a fundamental challenge in building modern applications and services. Although both academia and industry have developed a plethora of systems in recent years to help developers working with Big Data, many of them still follow the pattern of passively responding to users' queries, rather than processing and delivering data to interested users actively. We need systems for activating Big Data at scale and reducing the users' effort in working with Big Active Data.
In this dissertation, we explore three problems related to activating Big Data at scale. We first investigate the problem of enabling data enrichment during data ingestion. We discuss the needs and challenges in enriching data during data ingestion, and we introduce a new ingestion framework into AsterixDB - dynamic data feeds - that supports complex data enrichment functions and captures relevant data changes in the system during ingestion. We show the design and implementation of the new ingestion framework and evaluate its performance using different enrichment use cases.
Then, we look at the Big Active Data (BAD) challenge. We describe a BAD world that consists of different types of users and requests, and we propose a BAD system for providing BAD services for BAD users. We first review the initial prototype of the BAD system - BAD-RQ - and we discuss its limitations in BAD continuous use cases. We introduce a new BAD service - BAD-CQ - for providing continuous query semantics in the BAD system. Further, we use an alternative system constructed by gluing together multiple existing Big Data systems to show the challenges in providing BAD services without the BAD system. We measure the performance of BAD-CQ with various workloads and compare that with the alternative system's performance.
Last but not least, we study how to allow users to declaratively create scalable data sharing services between multiple BAD system instances without having to create and manage dedicated programs/services. We describe the notion of BAD islands that consist of multiple BAD instances and introduce new features to the BAD system for "bridging" multiple BAD instances together. We use a sample use case to illustrate how to create bridges between different BAD systems. To this end, we present a demonstration system that also involves the use of dynamic data feeds and BAD-CQ to show how BAD islands work.