Supporting Interactive Analytics and Visualization on Large Data
- Author(s): Jia, Jianfeng
- Advisor(s): Li, Chen
- et al.
There is an increasing demand to visualize large datasets as human observable reports in order to quickly draw insights and gain timely awareness from the data. An interactive user interface is an indispensable tool that allows users to analyze the data from different perspectives and to inspect the result from the global overview to the finest granularity. To enable this type of interactive user experience, the front-end can generate new requests on the fly, and the results must be computed and delivered within seconds. Big Data platforms can take tens or hundreds of seconds to complete an OLAP-style query, so there is a need for a solution that can meet the stringent latency requirement of interactive visualization frontends.
In this thesis, we address the interactivity challenges from a middleware perspective to provide a generic solution that can utilize existing database systems as a "black box" to support various interactive visualization applications efficiently.
We present Cloudberry, an open-source general-purpose middleware system to support interactive analytics and visualization on big data with various attributes. It can automatically create, maintain, and delete materialized views by analyzing each request and its results. We build an application called "TwitterMap" using Cloudberry to demonstrate its suitability to support interactive analytics and visualization on more than one billion tweets (about 2TB).
We then present a query slicing technique in Cloudberry, called Drum, that can "slice" a query into small pieces (called "mini-queries") so that the middleware can send these mini-queries to the DBMS one by one and compute results progressively. Our experiments on a large, real dataset show that Drum technique can reduce the delay of delivering intermediate results to the user without much reduction of the overall speed.
Finally, we present a method of using LSM filters to accelerate secondary-to-primary index search under the LSM storage setting. We have implemented it in Apache AsterixDB, and our experiments show that the new approach can reduce the query time by 20% to 70% for different queries.