External Data Access and Indexing in a Scalable Big Data Management System
- Author(s): Alamoudi, Abdullah Abdulrahman
- Advisor(s): Carey, Michael J
- et al.
Traditional Database Management Systems (DBMS) offer a long list of quality attributes such as high performance, a flexible query interface, accuracy, reliability and fault tolerance. However, in order for users to get these benefits, they need to first have their data loaded into the system and stored in its storage layer using the system's binary format and utilizing its different data structures. The space requirements and the computational and operational cost of loading data is unjustifiable at times. This cost increases as data becomes larger and larger, especially when existing systems are generating these data continually, e.g., by producing system logs. For these reasons, many existing applications don't use DBMSs at all and instead rely on custom scripts or specialized code that lack the qualities offered by DBMSs. This problem has been acknowledged by many Data Management Systems that also provide ways for users to use their query language to carry out different analysis against data in raw format. However, external data access in most of these systems involves expensive full scan operations, affecting the performance of these DBMSs to a great extent. For this reason, many data management systems provide external data access to facilitate data loading and not for ongoing use for routine data querying and analysis.
Recently, several research projects have sought to improve efficiency for external data access using different techniques. Each of these techniques has certain limitations, such as having to change the existing external data or having to write it in the first place through a specialized system, or has resulted in very small performance gains. In AsterixDB, the big data management system developed in UCI, we have designed and implemented a new feature that allows building incrementally refreshable indexes over external data. In this thesis, we explain in detail the different types of external data AsterixDB can access through its adapters. We then explain the semantics and user model associated with the indexing of external data. We follow that with a discussion of the system design for indexing external data and show how the system addresses the different challenges associated with this task. We further provide an evaluation of query performance over external data that resides in Hadoop Distributed File System. We compare AsterixDB's external data access with Hive on the same data files and with the same data after loading it into AsterixDB's internal storage. We show that a user can get competitive performance using AsterixDB without having to first load their data into the system.