Search

Scholarly Works (22 results)

Sort By:

Show:

Thesis
Peer Reviewed

GLADE-ML: A Database For Big Data Analytics

Qin, Chengjie
Advisor(s): Rusu, Florin

UC Merced Electronic Theses and Dissertations (2016)

Big Data Analytics has been a hot topic in computing systems and varies systems have emerged to better support Big Data Analytics. Though databases have been the data hub for decades, they fall short of Big Data Analytics due to inherent limitations. This dissertation present GLADE-ML, a scalable and efficient parallel database that is specifically tailored for Big Data Analytics. Different from traditional databases, GLADE-ML provides iteration management and explicit or implicit randomization in the execution strategy. GLADE-ML provides in-database analytics which outperforms other in-database analytics solutions by several orders of magnitude.

GLADE-ML also introduces dot-product join operator in GLADE-ML. Dot-product join operator is specifically designed for Big Models. Big Data analytics has been approached exclusively from a data-parallel perspective, where data are partitioned to multiple workers – threads or separate servers – and model training is executed concurrently over different partitions, under various synchronization schemes that guarantee speedup and/or convergence. The dual -- Big Model -- problem that, surprisingly, has received no attention in database analytics, is how to manage models with millions if not billions of parameters that do not fit in memory. This distinction in model representation changes fundamentally how in-database analytics tasks are carried out.

GLADE-ML supports model parallelism over massive models that cannot fit in memory. GLADE-ML extends the lock-free HOGWILD!-family of algorithms to disk-resident models by vertically partitioning the model offline and asynchronously updating the resulting partitions online. Unlike HOGWILD!, concurrent requests to the common model are minimized by a preemptive push-based sharing mechanism that reduces both the number of disk accesses as well as the cache coherency messages between workers. Extensive experimental results for three widespread analytics tasks on real and synthetic datasets show that the proposed framework achieves similar convergence to HOGWILD!, while being the only scalable solution to disk-resident models.

Another distinct feature of GLADE-ML is Hyper-Parameter Tuning. Identifying the optimal hyper-parameters is a time-consuming process that the computation has to be executed from scratch for every dataset/model combination even by experienced data scientists. GLADE-ML provides speculative parameter testing which applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration.

Cover page: GLADE-ML: A Database For Big Data Analytics

Article
Peer Reviewed

Speculative Approximations for Terascale Analytics

UC Merced Previously Published Works (2014)

Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascale-size synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a $1/20^{\text{th}}$ fraction of the time.

Cover page: Speculative Approximations for Terascale Analytics

Thesis
Peer Reviewed

Query Optimization using Sketches in Relational Database Systems

Izenov, Yesdaulet
Advisor(s): Rusu, Florin

UC Merced Electronic Theses and Dissertations (2023)

Query optimization remains a crucial element of relational database systems. With rapidly expanding data volumes and an increasing trend of machine-generated queries, the significance of query optimization is only increasing and requires continuous advancements. The objective of the query optimizer is to identify an optimal query execution plan from a vast number of semantically equivalent query plans. The success of this search process depends on the optimal operation of the internal inter-connected components of the query optimizer.

In this dissertation, we introduce COMPASS, a novel query optimization paradigm for in-memory databases based on a single type of statistics - Fast-AGMS sketches. While maintaining high accuracy, the highly parallelizable nature of Fast-AGMS empowers the query optimizer to accommodate more complex queries. Subsequently, we redefine the objective of the query optimizer to find a spanning tree with a low cost. Capitalizing on the polynomial time complexity of spanning tree algorithms, we present ESTE, an ensemble spanning tree-based enumeration strategy. ESTE systematically enumerates different parts of the search space, thereby enhancing the robustness of the query optimizer. We believe this perspective enables the application of well-studied spanning-tree algorithms to the field of query optimization. Finally, we address the impact of cardinality estimation errors on query optimizers. Given their inevitability, these errors can cause a domino effect, leading to additional mistakes in the subsequent components of the optimizer. We propose L1-error, a new indicator designed to identify sub-optimal plans. L1-error accounts for the fact that certain estimation errors may have more impact on selecting an optimal plan than others.

Cover page: Query Optimization using Sketches in Relational Database Systems

Creative Commons 'BY-NC-ND' version 4.0 license

Thesis
Peer Reviewed

Gradient Descent Optimization on Heterogeneous Architectures

Ma, Yujing
Advisor(s): Rusu, Florin

UC Merced Electronic Theses and Dissertations (2021)

There is an increased interest in building machine learning frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow, implement their compute-intensive primitives in two flavors --- as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPUs. The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. This procedure does not employ effectively the extensive CPU and memory resources -- which are used only to schedule computation and transfer data -- available on the servers. Moreover, for multi-GPU systems, the clock/memory speed may vary a lot even for the GPUs with the same model from the same vendor. The heterogeneity of GPUs has to be carefully considered. In addition, the optimization algorithm for the deep learning framework also plays a pivotal role in the training performance. Gradient descent (GD) is the most popular optimization method for model training on modern machine learning platforms. However, its convergence and its adaptation to heterogeneous systems is still an open research direction. In this dissertation, we perform a comprehensive experimental study of parallel GD for training machine learning models. We consider the impact of three factors -- computing architecture, synchronous or asynchronous model updates, and data sparsity -- on three measures --- hardware efficiency, statistical efficiency, and time to convergence. We introduce a generic heterogeneous CPU+GPU framework that exploits the difference in computational power and memory hierarchy between CPU and GPU through asynchronous message passing to maximize performance and resource utilization. Based on insights gained through experimentation with the framework, we design two heterogeneous asynchronous GD algorithms --- CPU+GPU HogBatch and Adaptive HogBatch. We also build a heterogeneity-aware multi-GPU framework to reduce the synchronization cost and data transfer cost of the training. We present the novel synchronous GD algorithm to tackle the heterogeneity challenge in multi-GPU systems --- Adaptive GD. We successfully show that the implementations of these algorithms in the proposed frameworks greatly accelerate the convergence and significantly achieve higher resource utilization than state-of-the-art machine learning systems on several real datasets.

Cover page: Gradient Descent Optimization on Heterogeneous Architectures

Article
Peer Reviewed

Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID

UC Merced Previously Published Works (2015)

Evaluating the performance of scientific data processing systems is a difficult task considering the plethora of application-specific solutions available in this landscape and the lack of a generally-accepted benchmark. The dual structure of scientific data coupled with the complex nature of processing complicate the evaluation procedure further. SS-DB is the first attempt to define a general benchmark for complex scientific processing over raw and derived data. It fails to draw sufficient attention though because of the ambiguous plain language specification and the extraordinary SciDB results. In this paper, we remedy the shortcomings of the original SS-DB specification by providing a formal representation in terms of ArrayQL algebra operators and ArrayQL/SciQL constructs. These are the first formal representations of the SS-DB benchmark. Starting from the formal representation, we give a reference implementation and present benchmark results in EXTASCID, a novel system for scientific data processing. EXTASCID is complete in providing native support both for array and relational data and extensible in executing any user code inside the system by the means of a configurable metaoperator. These features result in significant improvement over SciDB at data loading, extracting derived data, and operations over derived data. © 2014 Springer Science+Business Media New York.

Cover page: Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID

Article
Peer Reviewed

Dot-Product Join: An Array-Relation Join Operator for Big Model Analytics

UC Merced Previously Published Works (2016)

Big Model analytics tackles the training of massive models that go beyond the available memory of a single computing device, e.g., CPU or GPU. It generalizes Big Data analytics which is targeted at how to train memory-resident models over out-of-memory training data. In this paper, we propose an in-database solution for Big Model analytics. We identify dot-product as the primary operation for training generalized linear models and introduce the first array-relation dot-product join database operator between a set of sparse arrays and a dense relation. This is a constrained formulation of the extensively studied sparse matrix vector multiplication (SpMV) kernel. The paramount challenge in designing the dot-product join operator is how to optimally schedule access to the dense relation based on the non-contiguous entries in the sparse arrays. We prove that this problem is NP-hard and propose a practical solution characterized by two technical contributions---dynamic batch processing and array reordering. We devise three heuristics -- LSH, Radix, and K-center -- for array reordering and analyze them thoroughly. We execute extensive experiments over synthetic and real data that confirm the minimal overhead the operator incurs when sufficient memory is available and the graceful degradation it suffers as memory becomes scarce. Moreover, dot-product join achieves an order of magnitude reduction in execution time over alternative in-database solutions.

Cover page: Dot-Product Join: An Array-Relation Join Operator for Big Model Analytics

Thesis
Peer Reviewed

An Experimental Study of Distributed Quantile Estimation

Zhuang, Zixuan
Advisor(s): Rusu, Florin

UC Merced Electronic Theses and Dissertations (2015)

Quantiles are very important statistics information used to describe the distribution of datasets. Given the quantiles of a dataset, we can easily know the distribution of the dataset, which is a fundamental problem in data analysis. However, quite often, computing quantiles directly is inappropriate due to the memory limitations. Further, in many settings such as data streaming and sensor network model, even the data size is unpredictable. Although the quantiles computation has been widely studied, it was mostly in the sequential setting. In this paper, we study several quantile computation algorithms in the \emph{distributed} setting and compare them in terms of space usage, running time, and accuracy. Moreover, we provide detailed experimental comparisons between several popular algorithms. Our work focuses on the approximate quantile algorithms which provide error bounds. Approximate quantiles have received more attentions than exact ones since they are often faster, can be more easily adapted to the distributed setting while giving sufficiently good statistical information on the data sets.

Cover page: An Experimental Study of Distributed Quantile Estimation

Thesis
Peer Reviewed

Multi Query Optimization in GLADE

Rafay, Abdur
Advisor(s): Rusu, Florin

UC Merced Electronic Theses and Dissertations (2016)

SQL-on-Hadoop systems, query optimization, data distribution over multiple nodes and parallelization techniques are few of the areas under extreme research these days. Big names like Amazon, Google, Microsoft and many more are working on implementing systems for faster access of data from multiple nodes, reducing data mobility and increasing the parallelization. Customer’s queries are retrieved and reviewed by the database systems in an efficient way in the least amount of time by the introduction of various parallelization techniques by running the same query in parallel over different nodes carrying the data. Apart from multi-threading parallelization, there is another way of parallelization that can be performed in order to further reduce retrieval time, hence improving the efficiency of the system; parallelization on user queries on top of a DBMS/RDBMS. In this paper, we will study one such technique of how multiple queries can run simultaneously on a system in order to increase the system efficiency by reducing the data retrieval from the storage. Maximum sharing of workload has been performed by generating optimal and ubiquitous join plans for a set of queries and then fed them to GLADE (Generalized Linear Aggregate Distributed Engine), a scalable distributed system for large scale data analytics. Our main work is centered on generating GLADE join plans for a Multi-Query, satisfying maximum number of queries in order to maximize data sharing and minimize data retrieval for each individual query.

Cover page: Multi Query Optimization in GLADE

Thesis
Peer Reviewed

High Throughput Push Based Storage Manager

Zhu, Ye
Advisor(s): Rusu, Florin

UC Merced Electronic Theses and Dissertations (2019)

The storage manager, as a key component of the database system, is responsible for organizing, reading, and delivering data to the execution engine for processing. According to the data serving mechanism, existing storage managers are either pull-based, incurring high latency, or push-based, leading to a high number of I/O requests when the CPU is busy. To improve these shortcomings, this thesis proposes a push-based prefetching strategy in a column-wise storage manager. The proposed strategy implements an efficient cache layer to store shared data among queries to reduce the number of I/O requests. The capacity of the cache is maintained by a time access-aware eviction mechanism. Our strategy enables the storage manager to coordinate multiple queries by merging their requests and dynamically generate an optimal read order that maximizes the overall I/O throughput. We evaluated our storage manager both over a disk-based redundant array of independent disks (RAID) and an NVM Express (NVMe) solid-state drive (SSD). With the high read performance of the SSD, we successfully minimized the total read time and number of I/O accesses.

Cover page: High Throughput Push Based Storage Manager

Article
Peer Reviewed

PF-OLA: a high-performance framework for parallel online aggregation

UC Merced Previously Published Works (2014)

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8 TB TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead. © 2013 Springer Science+Business Media New York.

Cover page: PF-OLA: a high-performance framework for parallel online aggregation