The increasing prevalence of large graph data has produced a variety of research and applications tailored toward graph data management. Users aiming to perform graph analytics will typically start by importing existing data into a separate graph-purposed storage engine. The cost of maintaining a separate system (e.g., the data copy, the associated queries, etc...) just for graph analytics may be prohibitive for users with Big Data. Furthermore, using separate systems for mixed-model analytics (e.g., JSON and graph) requires specialized solutions. In this thesis, we introduce Graphix and show how it enables property graph views of existing document data in AsterixDB, a Big Data management system boasting a partitioned-parallel query execution engine.
This thesis starts with a description of how Graphix property graphs naturally extend the AsterixDB document model to define vertices and edges. We detail how users can specify Graphix graphs in a manner that handles a wide variety of document-to-graph mappings while maintaining the schema-flexibility offered by AsterixDB. Next, we explain how users can query their Graphix graphs. The Graphix query language (gSQL++) minimally extends AsterixDB's query language (SQL++) to express synergistic graph and traditional (multi-model) analytics. After describing the user model aspects of Graphix, we detail how AsterixDB was extended to accommodate a recursive graph query construct: path finding. We focus on how the AsterixDB runtime layer was extended to realize semi-synchronous, partitioned-parallel recursion. We later discuss how to extend the query optimization layer as well as the query parsing and AST rewriting layer to reuse as much of AsterixDB as possible. This thesis concludes with an evaluation of Graphix against a native graph database, Neo4j. We show that Graphix is able to scale horizontally to perform on-par with (and in some cases, even outperform) Neo4j for many kinds of operational and analytical queries -- ultimately illustrating that users might not need a separate graph database just to issue graph queries.