Search

Scholarly Works (48 results)

Sort By:

Show:

Thesis
Peer Reviewed

Improving SQL Performance Using Middleware-Based Query Rewriting

Bai, Qiushi
Advisor(s): Li, Chen

UC Irvine Electronic Theses and Dissertations (2023)

Query performance is critical in database-supported applications where users need answers quickly to make timely decisions. For decades, databases have relied heavily on query rewriting to optimize SQL query performance. However, with the current prevalent use of business intelligence and interactive visualization systems, purely relying on the rewriting capabilities inside databases is insufficient to optimize queries generated by those modern applications. On the one hand, different applications have various performance requirements. Some applications prioritize responsiveness, requiring queries to be executed within strict time constraints, and others may prioritize accuracy. Traditional database-centric approaches fail to exploit such information and adapt to diverse application requirements. On the other hand, developers and domain experts possess valuable insights into the data and query patterns specific to their applications. However, the query optimization techniques customized using domain knowledge can be infeasible inside databases.In this thesis, we focus on providing middleware-based query-rewriting techniques to help databases seize opportunities to optimize queries. First, for many applications with stringent response time constraints, such as interactive visualization systems, we propose a machine learning-powered query rewriting framework (called Maliva) to rewrite the queries with various options and help the databases generate efficient plans. Maliva leverages those expensive and high-accuracy query cost estimators to guide their rewriting process. By considering a pre-defined time constraint, Maliva judiciously explores different rewriting options and balances the query planning time and the execution time to find an efficient rewritten query that meets the time constraint. Second, in many cases, developers want to use their domain knowledge about the applications and datasets to rewrite queries for better performance. We propose a human-centered query rewriting solution (called QueryBooster) to provide users with an express and easy-to-use rule language to define rewriting rules. In addition, QueryBooster allows users to express their rewriting intentions by providing example query pairs. QueryBooster then automatically generalizes them into rewriting rules and suggests high-quality ones to the users. Finally, to lower the bar of users adopting the proposed rewriting framework, we implemented QueryBooster as middleware-based multi-user system to provide query rewriting between applications and databases as a service. Treating both the applications and databases as black boxes, QueryBooster requires no code modifications to them. To use the service, users only need to replace the database connector between the application and the database with a customized version provided by QueryBooster. The customized connector automatically intercepts application queries and sends them to QueryBooster to rewrite them based on user-defined rewriting rules. QueryBooster’s service model brings SQL query rewriting to a new paradigm where (1) users can easily formulate, control, and monitor query rewriting; (2) they can share rewriting knowledge and benefit from the wisdom of the crowd; and (3) they enjoy the non-intrusiveness security and pay-as-you-go convenience.

Cover page: Improving SQL Performance Using Middleware-Based Query Rewriting

Creative Commons 'BY-NC-SA' version 4.0 license

Thesis
Peer Reviewed

Towards Interactive, Adaptive and Result-aware Big Data Analytics

Kumar, Avinash
Advisor(s): Li, Chen

UC Irvine Electronic Theses and Dissertations (2022)

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A common objective pursued by these traditional cluster-based big data processing frameworks is high performance, which often means low end-to-end execution time or latency.

A typical user of these frameworks submits a job to the framework and waits for the results for minutes, hours or even days based on the size of input data and complexity of the job. There is often a need to interact with an executing job to check its states or modify parts of the job. Traditional big data processing frameworks offer little insight into an executing job. They provide simple statistics such as data size input into and processed by various operators of a job, which may not be enough information for the user.

The widespread adoption of data analytics has led to a call to improve the traditional ways of big data processing. There have been demands for making the analytics process more interactive and adaptive, especially for long running jobs. A typical data analytics workflow undergoes multiple iterations of refinement to become the final workflow that performs a task correctly. While performing these iterations, a data analyst is more interested in seeing the first few results quickly than the total execution time. If the results are undesirable, the analyst can terminate the workflow without waiting for it to execute completely. This underlines the importance of initial results in the iterative process of data wrangling and motivates a result-aware approach to big data analytics.

This dissertation is motivated by these calls for improvement in data processing and the experiences over the past few years while working on the Texera project, which is a collaborative data analytics service being developed at UC Irvine. Texera is a GUI-based service that allows the users to drag-and-drop operators to create workflows that can be executed on computing clusters. This dissertation mainly consists of three parts. The first part is about the design of the Amber engine that serves as the backend data processing framework for the Texera service. Amber supports interactivity and adaptivity during data analysis. A key feature of Amber is the existence of fast control messages that allow the interaction and adaptation to happen with sub-second latency. The second part is about an adaptive and result-aware skew-handling framework called Reshape. Reshape uses fast control messages to implement iterative skew mitigation techniques for a wide variety of operators. The mitigation techniques in Reshape have also been analyzed from the perspective of their effects on the results shown to the user. Reshape is also capable of self-tuning its threshold parameter to lessen the technical burden on the users. The last part is about a result-aware workflow scheduling framework called Maestro. This part talks about how to schedule a workflow for execution on computing clusters and make result-aware decisions while doing so. This work improves the data analytics process by bringing interactivity, adaptivity and result-awareness into the process.

Cover page: Towards Interactive, Adaptive and Result-aware Big Data Analytics

Creative Commons 'BY' version 4.0 license

Thesis
Peer Reviewed

Texera: A System for Collaborative and Interactive Data Analytics Using Workflows

Wang, Zuozhi
Advisor(s): Li, Chen

UC Irvine Electronic Theses and Dissertations (2023)

In the world of data analytics, domain experts, such as public health scientists and medical researchers, play a crucial role as their domain knowledge can unlock valuable insights from data. However, they face several challenges in the current landscape of data analytics tools. They often lack the technical skills necessary to analyze large datasets, requiring collaboration with technical experts who may not have relevant domain knowledge. Moreover, when processing large volumes of data, the execution times can be lengthy, and non-technical users are left in the dark without feedback.

Over the past six years, our team has been developing Texera, a workflow-based data analytics system specifically designed to enable non-technical users to perform data analytics tasks with ease by promoting seamless collaboration and responsive interactions. Texera enables multiple users to collaboratively construct workflows, offering an experience similar to that of Google Docs and Overleaf. Furthermore, Texera allows users to interact with the workflow execution, enabling them to pause/resume workflows, inspect execution states, and modify logic as needed.

In this thesis, we first present an overview of the Texera system in Chapter 2, discussing the design choices and the associated tradeoffs of several key components within Texera that enable these powerful features of real-time collaborations and user interactions. Following this, in Chapter 3, we explore a specific use case of user interaction: modifying the logic of operators in a workflow, also referred to as reconfigurations. We develop an algorithm called Fries, which can schedule these reconfigurations with minimal delay while maintaining transactional guarantees, particularly when a reconfiguration involves multiple operators. In Chapter 4, we shift our focus to incremental data processing, as Texera uses progressive computation to deliver early results to users. We present Tempura, a cost-based optimization framework designed for incremental processing. As a general framework, Tempura can support various incremental computation requirements for many different applications and use cases even beyond Texera's scope. Tempura can select the best incremental computation plan based on the specific query and data involved. In Chapter 5, we conclude this thesis and discuss future work.

Cover page: Texera: A System for Collaborative and Interactive Data Analytics Using Workflows

Thesis
Peer Reviewed

Study of Phonon Engineering Using Neutron Scattering

Chen, Shuonan
Advisor(s): Li, Chen

UC Riverside Electronic Theses and Dissertations (2022)

Phonon, a quantized lattice vibration, plays an important role in the materials’ physical and mechanical properties. This makes investigations on phonon dynamics an indispensable subject for better understanding the physical world. For decades, people have been seeking ways to manipulate phonon dynamics, which are closely related to atomic structures and electromagnetic properties of materials, for the development of novel materials with desired properties and more insights into the fundamental physics. This dissertation discusses the studies of phonon engineering in the spatially confined silicon systems, which have been widely used in the biomedical filed and have great potential applications in optoelectronic industry, as well as bulk sapphire systems using inelastic neutron scatterings.The dissertation starts with introducing the basic information about the phonon dynamics and neutron scatterings which are the two primary subjects of my graduate research. In Chapters 2 to 4, the inelastic neutron scattering experiments and discussions on the effects of particle size, temperature, surface oxidization, and surface functionalization on the phonon dynamics of silicon nanocrystals are presented. These effects are found to be greater on the transverse acoustic phonon modes than the optical phonon modes in silicon nanocrystals. In Chapter 5, the atomic structures of 3-nm spherical silicon nanocrystals were measured with elastic neutron scattering for the first time. The diffraction spectra show huge anisotropic structure variations inside the silicon nanocrystals compared to their bulk counterpart. In Chapter 6, the effect of the low concentration dopants on the phonon dynamics of sapphire was studied using inelastic neutron scattering. This dissertation sheds light on the phonon dynamics, as well as their dependence on the intrinsic and extrinsic effects, of materials with great potential applications and will contribute to further investigations on the phonon engineering of various materials.

Cover page: Study of Phonon Engineering Using Neutron Scattering

Thesis
Peer Reviewed

Supporting Interactive Analytics and Visualization on Large Data

Jia, Jianfeng
Advisor(s): Li, Chen

UC Irvine Electronic Theses and Dissertations (2017)

There is an increasing demand to visualize large datasets as human observable reports in order to quickly draw insights and gain timely awareness from the data. An interactive user interface is an indispensable tool that allows users to analyze the data from different perspectives and to inspect the result from the global overview to the finest granularity. To enable this type of interactive user experience, the front-end can generate new requests on the fly, and the results must be computed and delivered within seconds. Big Data platforms can take tens or hundreds of seconds to complete an OLAP-style query, so there is a need for a solution that can meet the stringent latency requirement of interactive visualization frontends.

In this thesis, we address the interactivity challenges from a middleware perspective to provide a generic solution that can utilize existing database systems as a "black box" to support various interactive visualization applications efficiently.

We present Cloudberry, an open-source general-purpose middleware system to support interactive analytics and visualization on big data with various attributes. It can automatically create, maintain, and delete materialized views by analyzing each request and its results. We build an application called "TwitterMap" using Cloudberry to demonstrate its suitability to support interactive analytics and visualization on more than one billion tweets (about 2TB).

We then present a query slicing technique in Cloudberry, called Drum, that can "slice" a query into small pieces (called "mini-queries") so that the middleware can send these mini-queries to the DBMS one by one and compute results progressively. Our experiments on a large, real dataset show that Drum technique can reduce the delay of delivering intermediate results to the user without much reduction of the overall speed.

Finally, we present a method of using LSM filters to accelerate secondary-to-primary index search under the LSM storage setting. We have implemented it in Apache AsterixDB, and our experiments show that the new approach can reduce the query time by 20% to 70% for different queries.

Cover page: Supporting Interactive Analytics and Visualization on Large Data

Thesis
Peer Reviewed

Improving Iterative Analytics in GUI-Based Data-Processing Systems with Visualization, Version Control, and Result Reuse

Alsudais, Sadeem
Advisor(s): Li, Chen

UC Irvine Electronic Theses and Dissertations (2023)

GUI-based data processing systems simplify and accelerate data tasks with a user-friendly interface, eliminating the need for extensive coding skills. This accessibility allows analysts to easily design, modify, and execute workflows with intuitive drag-and-drop operations and visual representations. Incorporating visualization operators into data processing systems to represent the processed result enables analysts to quickly gain insights, understand patterns, and make informed decisions from complex data. As analysts observe the results, they may uncover new trends, leading to further questions or hypotheses that require modifications and edits to the workflow. Each change to the workflow generates a new version. Given the iterative nature of data analytics, modifying workflows is a common practice. The results produced from executing these versions are materialized, enabling users to refer back to them to reproduce and replicate past experiments, ensuring the validity of reported outcomes. While striving for improved results, in many cases, the results of new iterations are equivalent to those of previous runs. Given the significant time required to execute analytical tasks on large datasets, it becomes imperative to reduce redundant computations by reusing previously-stored results. Hence, it is crucial to identify and verify the equivalence of results across different runs.

This dissertation is driven by these pressing needs to enhance iterative data analytics within GUI-based data processing systems by integrating visualization, version control, and result reuse. The dissertation is structured into four main parts.

The first part addresses the challenge of incrementally visualizing large spatial networks while minimizing visual clutter. To tackle this issue, we introduce GSViz, a general-purpose middleware-based solution consisting of two modules, namely edge-aware vertex clustering and incremental edge bundling to effectively visualize large spatial networks.

The second part focuses on the development of Drove, a framework designed to track changes in workflows, environment dependencies, workflow executions, and the generated results. By utilizing Drove, researchers and analysts can gain valuable insights into the evolution of workflows and understand the impact of modifications on the final outcomes.

In the third part, we present Veer, an algorithm for verifying the equivalence of two complex workflow versions. Additionally, we present a series of optimization techniques to improve the performance of the baseline algorithm.

Lastly, we introduce Raven, an optimization framework that ranks the previously executed workflow versions then it tests their equivalence compared to a new workflow version execution request. By reusing the results generated from these versions, Raven minimizes redundant computations and significantly enhances performance when handling new workflow execution requests. Raven retrieves the previous versions from Drove and pushes testing their equivalence to Veer.

Cover page: Improving Iterative Analytics in GUI-Based Data-Processing Systems with Visualization, Version Control, and Result Reuse

Thesis
Peer Reviewed

Utilization of Inelastic Scattering Techniques in Phonon Measurements

Cai, Qingan
Advisor(s): Li, Chen

UC Riverside Electronic Theses and Dissertations (2022)

The microscopic study of lattice vibrations is essential for regulating the thermal properties and understanding the phase transition of materials. As for the newly proposed and observed chiral phonons, they are significant in controlling the entanglement of quantum dots and generating the thermal Hall effect in materials. In layered transition metal chalcogenides and some other quantum materials, their lattice dynamics are mostly studied by first-principles calculations, the phonon measurement is relatively rare, especially with temperature and pressure dependence.

Phonon theory and experimental techniques, such as inelastic X-ray scattering, inelastic neutron scattering, and Raman scattering, for phonon measurement are briefly discussed. The phonon computational method is also reviewed. Phonon measurements and theoretical calculations were performed on some layered materials and other quantum materials.

Using millielectronvolt-resolution non-resonant inelastic X-ray scattering, we discovered that it could be utilized to directly probe phonon chirality throughout the whole Brillouin zone in tungsten carbide. The results show that phonon chirality and X-ray polarization play essential roles in the scattering process. The results also suggest that a revision to the textbook X-ray scattering function of phonons is needed.

To study the temperature and pressure dependence of lattice dynamics in materials, especially for layered transition metal chalcogenides, we performed the first temperature- and pressure-dependent inelastic X-ray scattering measurements on bulk tungsten diselenide and obtained the mode Grüneisen parameters. The results show monolayer-like lattice dynamics in the bulk tungsten diselenide. We also performed the pressure-dependent phonon measurement on palladium diselenide. A panoramic diamond anvil cell was used to generate the high hydrostatic pressure. We observed the pressure-dependent flexural phonons for the first time and quantified the elastic properties and interlayer van der Waals interactions in layered materials.

Using inelastic neutron scattering, temperature- and pressure-dependent phonon lattice dynamics measurements on p-terphenyl were studied. The results indicate strong anharmonic phonon dynamics and suggest a lack of phase transition in the region of 0~1.51 kbar and 10~30 K.

Using Raman scattering, the pressure- and temperature-dependent results on Fe3GeTe2 were performed, and a significant pressure-induced phonon energy shift was observed. The phonon energy shift may be related to the strong spin-phonon interactions, which may play important roles in its application for magnetic storage devices.

Cover page: Utilization of Inelastic Scattering Techniques in Phonon Measurements

Thesis
Peer Reviewed

Using Random Forest to Classify Raman Spectra of Brain Tissues

Zhang, Weiyi
Advisor(s): Li, Chen

UC Riverside Electronic Theses and Dissertations (2022)

Traditional diagnosis of brain tumors is performed by neurologic exams and relies on specialists. The key difference between normal tissues and brain tumors can also be reflected through their Raman spectrum, which provides a fingerprint to identify different matters. In this thesis, we present an integral process from raw data pre-processing to model conducting and evaluation for identifying the white matter, grey matter and blood vessels. The mock spectra and several machine learning algorithms were used for choosing the configuration for the pipeline. The result shows the good prediction and stability of our approach in discriminating these three types of spectra with high accuracy.

1 supplemental file

Cover page: Using Random Forest to Classify Raman Spectra of Brain Tissues

Thesis
Peer Reviewed

Transactional and Spatial Query Processing in the Big Data Era

Kim, Young-Seok
Advisor(s): Li, Chen

UC Irvine Electronic Theses and Dissertations (2016)

Over the past decade, the proliferation of mobile devices has generated a variety of data at an unprecedented rate. The trend will be further accelerated by the advent of the Internet-of- Things era. Such data include signals, texts, photos, and videos tagged with date, time, and geo coordinates. The data are structured, semi-structured, or unstructured. Data-processing systems that aim to ingest, store, index, and analyze Big Data must deal with such data efficiently. In response, we have developed Apache AsterixDB, a parallel, semi-structured information management platform, that provides the ability to ingest, store, index, query, and analyze mass quantities of data.

The key contributions of this thesis fall in two major parts. First, in order to store and index newly generated data and make them queryable in a timely manner, a record-level transaction model was designed and implemented in AsterixDB based on the read-committed isolation level. Second, due to the importance of efficient query processing for such dynamic geo- tagged data, we implemented five variants of representative, disk-resident spatial indexing methods on top of the Log-Structured Merge-tree-based (LSM) storage layer in AsterixDB and evaluated their pros and cons in light of the dynamic characteristics of geo-tagged Big Data.

Cover page: Transactional and Spatial Query Processing in the Big Data Era

Article
Peer Reviewed

On containment of conjunctive queries with arithmetic comparisons

UC Irvine Previously Published Works (2004)

We study the following problem: how to test if Q(2) is contained in Q(1), where Q(1) and Q(2) are conjunctive queries with arithmetic comparisons? This problem is fundamental in a large variety of database applications. Existing algorithms first normalize the queries, then test a logical implication using multiple containment mappings from Q(1) to Q(2). We are interested in cases where the containment can be tested more efficiently. This work aims to (a) reduce the problem complexity from Pi(2)(P)-completeness to NP-completeness in these cases; (b) utilize the advantages of the homomorphism property (i.e., the containment test is based on a single containment mapping) in applications such as those of answering queries using views; and (c) observing that many real queries have the homomorphism property. The following are our results. (1) We show several cases where the normalization step is not needed, thus reducing the size of the queries and the number of containment mappings. (2) We develop an algorithm for checking various syntactic conditions on queries, under which the homomorphism property holds. (3) We further reduce the conditions of these classes using practical domain knowledge that is easily obtainable. (4) We conducted experiments on real queries, and show that most of the queries pass this test.

Cover page: On containment of conjunctive queries with arithmetic comparisons