Data-intensive applications have attracted considerable attention from researchersin information sciences and enterprises, as these applications have made
evolutionary breakthroughs in scientific fields and are extremely valuable to produce
productivity in businesses. Recently, as the high speed growth of the new generated
data, researchers have begun to leverage the useful knowledge hidden in such
huge volume of data to optimize the performance of the data-intensive applications.
However, optimize the performance of the data-intensive applications based on the
data-driven approaches are still need to be explored.
In this thesis, we focus on data-driven performance optimization for dataintensive
applications. We first study an application, auto-labeling data on mobile
devices. How to accurately and efficiently label data on a mobile device is critical
for the success of training machine learning models on mobile devices. Auto-labeling
data for data-intensive applications on mobile devices is a challenging task, because
data is incrementally generated and there is a possibility of having unknown labels
among new coming data. Furthermore, the rich hardware heterogeneity on mobile
devices creates challenges on efficiently executing the autolabeling workload. We
introduce Flame, an auto-labeling system that can label dynamically generated data
with unknown labels. Flame includes an execution engine that efficiently schedules
and executes auto-labeling workloads on heterogeneous mobile processors. Evaluating
Flame with six datasets on two mobile devices, we demonstrate that the labeling
accuracy of Flame is 11.8%, 16.1%, 18.5%, and 25.2% higher than a state-of-the-art
labeling method, transfer learning, semi-supervised learning, and boosting methods
respectively. Flame is also energy efficient, it consumes only 328.65mJ and 414.84mJ
when labeling 500 data instances on Samsung S9 and Google Pixel2 respectively.
Furthermore, running Flame on mobile devices only brings about 0.75 ms additional
frame latency which is imperceivable by the users.
Second, we explore another data-intensive application, the cardinality estimation
in database systems. Cardinality estimation is a fundamental and critical
problem in databases. Recently, many estimators based on deep learning have been
proposed to solve this problem and they have achieved promising results. However,
these estimators struggle to provide accurate results for complex queries, due to not
capturing real inter-column and inter-table correlations. Furthermore, none of these
estimators contain the uncertainty information about their estimations. In this paper,
we present a join cardinality estimator called . learns the correlations across
all columns and all tables in the database. It also contains the uncertainty information
of each estimation. Among all studied learned estimators, our results are
promising: (1) has the smallest model size; (2) It has the fastest inference speed; (3)
Compared with the state of the art estimator, has 10× faster inference speed, and
provides 1.3× ∼ 6.7× smaller estimation errors for complex queries; (4) To the best
of our knowledge, is the first estimator that incorporates uncertainty information for
cardinality estimation into a deep learning model.
Furthermore, we also study the data loading problem for large-scale distributed
training. The resource-hungry and time-consuming process of training Deep
Neural Networks (DNNs) can be accelerated by optimizing and/or scaling computations
on accelerators such as GPUs. However, the loading and pre-processing of
training samples then often emerges as a new bottleneck. This data loading process
engages a complex pipeline that extends from the sampling of training data on
external storage to delivery of those data to GPUs, and that comprises not only
expensive I/O operations but also decoding, shuffling, batching, augmentation, and
other operations. We propose in this paper a new holistic approach to data loading
that addresses three challenges not sufficiently addressed by other methods: I/O load
imbalances among the GPUs on a node; rigid resource allocations to data loading and
data preprocessing steps, which lead to idle resources and bottlenecks; and limited efficiency
of caching strategies based on pre-fetching due to eviction of training samples
needed soon at the expense of those needed later. We first present a study of key bottlenecks observed as training samples flow through the data loading and preprocessing
pipeline. Then, we describe Lobster, a data loading runtime that uses performance
modeling and advanced heuristics to combine flexible thread management with optimized
eviction for distributed caching in order to mitigate I/O overheads and load
imbalances. Experiments with a range of models and datasets show that the Lobster
approach reduces both I/O overheads and end-to-end training times by up to 1.5×
compared with state-of-the-art approaches.
Finally, we study the cardinality estimation for string predicates in database
systems. Cardinality estimation for string predicates is a notoriously challenging
problem in database systems. This paper presents ArbiLIKE, an advanced deep
learning-based cardinality estimator for arbitrary LIKE predicates. ArbiLIKE utilizes
a cardinality-aware embedding technique to encode LIKE predicates into feature
vectors. It further incorporates an innovative sequence model to capture the semantic
information of different substrings, enhancing the estimation accuracy. ArbiLIKE is
also capable of handling LIKE predicates with any combination of wildcards (“%”,
“ ”). Empirical evaluations showcase ArbiLIKE’s promising accuracy, achieving estimation
errors that are up to 165.1× smaller than those of eight baselines, including
state-of-the-art methods. As a generic estimator, ArbiLIKE realizes error reductions
ranging from 1.4 to 93.1× for LIKE predicates with multiple wildcards in comparison
to the existing techniques. To the best of our knowledge, ArbiLIKE is the first deep
learning-based estimator capable of handling arbitrary LIKE predicates.