Search

Scholarly Works (18 results)

Sort By:

Show:

Thesis
Peer Reviewed

Highly Efficient String Similarity Search and Join over Compressed Indexes

Xiao, Guorui
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2023)

String similarity search and string similarity join are essential operations in many fields. Existing solutions adopt a filter-and-verification framework and build inverted indexes based on generated signatures to prune dissimilar candidates. While existing solutions mainly focus on improving the query processing performance, little attention is paid to reducing the inverted indexes’ memory consumption. In cases where the index size is larger than the memory, users must employ more expensive disk-based algorithms rather than in-memory ones. In this thesis, we propose a flexible framework CSS to reduce the index size and keep high query performance for string search and join applications. We give improved solutions for offline inverted list construction and introduce a new approach for the online construction of compressed inverted lists. Experimental results on large-scale datasets demonstrate that CSS can reduce memory consumption up to 5 times while having similar, or even better, query processing performance.

Cover page: Highly Efficient String Similarity Search and Join over Compressed Indexes

Thesis
Peer Reviewed

Declarative Languages and Systems for Transparency, Performance and Scalability in Database Analytics

Li, Youfu
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2020)

Demand for powerful, high-performance analytics on Big Data is ever growing. Developing tools and methodologies for advanced Database analytics, such as Data Mining applications, has long been an active area of research which posed elusive challenges to both academia and industry, on topics that include: 1) design of expressive high-level languages with declarative semantics for data analytics, 2) optimization and parallelization for efficient and scalable execution, and 3) transparency of analytics dataflow for error tracking and debugging. This thesis proposes methods and tools for developing powerful data analytics systems based on declarative languages, dataflow inspection and query optimization. By leveraging and integrating these tools we obtain i) a scalable data analytics framework for knowledge discovery by concise and declarative queries, ii) a unified solution that enables analytics dataflow inspection and further supports provenance and debugging for data analytic applications, and iii) an integrated runtime query optimizer to generate optimal execution plan for data analytics queries and achieve superior performance in application areas that had posed major challenges for traditional Database technology.

In particular, our KDDLog system enables users to build or customize knowledge discovery models by concise and expressive language, via recursive queries with aggregates and our newly-proposed chain aggregates. We further provide specialized compilation techniques for semi-naive fix-point computation in the presence of aggregates, optimizations for complex recursive queries on distributed data platforms, KDDLib to build knowledge discovery tasks and advanced interfaces to assist users to port new knowledge discovery models. Following KDDLog, we present SEIZE, a unified framework that enables dataflow inspection---wiretapping the data-path of data analytics applications with listening logic. We generalize our lessons learned by providing a set of primitives defining dataflow inspection, orchestration options for different inspection granularities, and operator decomposition and dataflow punctuation strategy for dataflow intervention. Finally, we propose RIOS, a runtime integrated query optimizer for data analytics that lazily binds to execution plans at runtime, after collecting the statistics needed to make more optimal decisions. A specific focus in our design is to obtain accurate estimates on predicate (including UDF) selectivities for determining an optimal join order and physical join implementation, without incurring significant runtime overheads.

Cover page: Declarative Languages and Systems for Transparency, Performance and Scalability in Database Analytics

Thesis
Peer Reviewed

Summarizing Massive Information for Querying Web Sources and Data Streams

Mousavi, Hamid
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2014)

Largely as a result of advances brought by the Web and related technologies, we are now experiencing a tremendous growth in the volume of data streaming between, and stored at, many nodes of the Internet. This "Big Data" revolution is underscoring the importance of summarization in general, and in particular in two new application areas that are rich of practical significance and interesting research challenges. Indeed, while summarization techniques, including sampling, histograms, and quantiles, remain critical in analyzing large data sets and optimizing queries in traditional databases, new techniques are needed to address the following two problems. The first is that, in addition to summarization techniques for stored data, we now need online/continuous summaries for the streaming data, e.g., real-time online histograms. When dealing with massive data streams and fast-changing distributions, summaries should be quickly updated with the newly arrived data, in order to reflect the most recent portion (window) of the data stream. The second problem is that the Web is storing large corpora of structured, semi-structured, and unstructured (free-text) documents, and these documents are subject to the ambiguities of natural language and the challenges they pose to machine processing. This situation has so far limited severely the ability of smart applications to use the information contained in Web pages, as needed to realize the Semantic Web vision. It is however clear that many of these limitations can be overcome and advanced searches and analysis applications can be supported, if the knowledge of each Web page can be summarized into a standard machine-friendly structure. In this dissertation, we attack these two difficult problems by proposing fast summarization techniques for (i) scalar information of data streams and (ii) textual information in Web pages. For scalar data, we present light and fast synopses, namely histograms, combined with various sampling approaches in order to implement more practical summarization techniques over massive data sets and data streams. To our knowledge, this technique provides the most accurate online histograms for data streams with sliding windows. For textual documents, we introduce several techniques and systems for extracting structured summaries from unstructured text and use these structured summaries to complete the existing ones as well as to improve their consistency.

Cover page: Summarizing Massive Information for Querying Web Sources and Data Streams

Thesis
Peer Reviewed

Multi-relational Representation Learning and Knowledge Acquisition

Chen, Muhao
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2019)

Multi-relational representation learning methods encode entities or concepts of a knowledge graph in a continuous and low-dimensional vector space, where the relational inferences of entities (concepts) are modeled as some simple vector algebras. Despite such knowledge representations being crucial to a wide range of knowledge-driven applications, state-of-the-art methods are limited to learning embeddings for simple relation facts in a single knowledge graph. In this dissertation, we pursue the goal of comprehensively capturing the multifaceted relational knowledge in various types of knowledge bases, and towards that we contribute on three fronts: (i) we introduce the first multi-relational representation learning framework that learns to transfer embeddings across multiple knowledge bases; (ii) we propose techniques for preserving relational facts with complex properties in the embedding space, including those enforce relational properties, form hierarchies, or endowed uncertainty; (iii) we investigate large-scale relational learning based on other modalities of data, with the aim of acquiring knowledge to enrich the knowledge bases.

Each of these three research problems presents a series of key challenges which we address. Thus, for transferred embeddings, we develop joint learning of relational structure encoders that confront the heterogeneity of contents in knowledge graphs, together with diverse types of alignment models that learn to transfer on the basis of simple, hierarchical or fuzzy alignment information. In addition, we extend the joint learning framework with semi-supervised co-training of entity descriptions, and proactive score propagation for fuzzy alignment, so as to conquer the scenarios where alignment information is limitedly provided. To capture complex relation facts, we focus first on the relational properties that cause non-linearity in embedding structures, for which we leverage a non-linear component-specific mappings of embeddings to eliminate the conflicts, and strengthens the learning process with hierarchical regularization. For uncertain relation facts, we preserve the uncertainty by utilizing Probablistic Soft Logic to guide the non-linear regressor that is jointly trained with the structure encoder. We further study the support of relational learning based on sequence data. Our model proposes generic neural sequence pair models to support large-scale relation detection, in which we incorporate different sequence encoders for heterogeneous data such as structured articles, amino acid sequences, and lexicographic knowledge.

The methods proposed in this dissertation extend the application of multi-relational embeddings, and improve a wide spectrum of applications in different domains. These include knowledge alignment, monolingual and cross-lingual knowledge graph completion, semantic search, entity typing, paraphrase identification, uncertain relation prediction, protein-protein interaction prediction, protein binding affinity estimation, single-cell RNA-sequence imputation, and Webscale sub-article matching.

Cover page: Multi-relational Representation Learning and Knowledge Acquisition

Thesis
Peer Reviewed

Power, Performance and Scalability for Big Data Query Languages: The Machine Learning Challenge

Wang, Jin
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2020)

In the Big Data era, there is a resurgence of interest in using Datalog to express data analysis applications that require recursive computations. However, the use of non-monotonic aggregates in recursion raises difficult semantic issues. Recent theoretical advances like monotonic aggregation and Pre-Mapability (PreM) provide the formal semantics for the usage of aggregates in recursive Datalog rules enabling the expression of a wide spectrum of advanced analytical tasks, such as graph analysis, data mining, machine learning and stream processing. In this dissertation, we explore opportunities and issues created by these advances, including the expressiveness of Datalog in advanced applications and their optimization to achieve superior performance and scalability.

Firstly, we find that Datalog serves as an efficient query language that simplifies the writing of machine learning applications and provides a unified environment for their development and deployment on multiple platforms. Following this route, we propose a declarative machine learning framework of tested effectiveness on top of Apache Spark. We present an in-depth theoretical analysis that shows how key ML algorithms can be expressed and efficiently implemented by recursive Datalog programs that use aggregates in recursion, whereby achieving both formal and efficient operational semantics. We also present the compilation and optimization techniques we developed to support the complex recursive queries required by ML applications in distributed share-nothing architectures. Next we share some theoretical results to show that programs computing any aggregates on sets of facts of predictable cardinality are equivalent to stratified programs where the pre-computation of cardinality of the set is followed by a stratum where recursive rules only use monotonic constructs. Finally, we investigate how to improve the parallelism of semi-naive evaluation of recursive Datalog programs on shared-memory multi-core machines, and discuss the prototype system we have developed and the high performance levels it delivers.

Cover page: Power, Performance and Scalability for Big Data Query Languages: The Machine Learning Challenge

Thesis
Peer Reviewed

Wikipedia Infobox Temporal RDF Knowledge Base and Indices

Song, Aige
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2015)

As real world evolves, Infoboxes for Wikipedia subjects are updated to reflect the information changes in the real world, and there is a growing interest in the evolution history of subjects in the Wikipedia. Thus, the management of historical information and the efficiencies of queries for these temporal information have become the major concern.

In this paper, we introduce the Wikipedia Infobox temporal RDF knowledge base that constructed from the Wikipedia Infobox history dump, and evaluate the efficiencies of temporal queries based on the temporal knowledge base. Specifically, we evaluate temporal selection and temporal join queries based on different database systems with different indices, including MySQL B+ Tree, PostgreSQL B-Tree, and Interval Tree.

Cover page: Wikipedia Infobox Temporal RDF Knowledge Base and Indices

Thesis
Peer Reviewed

Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery

Yang, Mohan
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2017)

The growing importance of data science applications has motivated great research interest in powerful languages and scalable systems for supporting advanced analytics on massive data sets. Languages such as R and Scala are used to develop advanced analytical applications that are not supported by SQL, the traditional query language used for decades to search the database and analyze its data. An interesting research question that arises in this scenario is whether it is possible to design an efficient query language that simplifies the writing of advanced analytical applications and provides a unified environment for their development and deployment on multiple platforms, including massively parallel ones. In this thesis, we provide a positive answer to this question by demonstrating extensions of the logic-based query language Datalog and their implementation techniques to enable (i) scalable support for graph analytics and knowledge discovery applications, and (ii) portability between multicore machines and clusters.

A first set of extensions discussed in this thesis is based on monotonic aggregates and led to the implementation of our Deductive Application Language (DeAL) system which (i) achieves superior performance for graph analytics applications compared with other Datalog systems on multicore machines, and (ii) outperforms other distributed Datalog systems, as well as both GraphX and native Apache Spark. We then tackle the difficult problem of supporting knowledge discovery applications, by introducing non-monotonic extensions to support generic user-defined aggregates, for which we provide a formal logic-based semantics. The Knowledge Discovery in Datalog (KDDlog) language so derived can express efficiently both descriptive analytics, such as rollups and data cubes, and predictive analytics, such as association rule mining, classification, regression analysis, and cluster analysis.

Cover page: Declarative Languages and Scalable Systems for Graph Analytics and Knowledge Discovery

Thesis
Peer Reviewed

Approximation and Search Optimization on Massive Data Bases and Data Streams

ZENG, KAI
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2014)

A fast response is critical in many data-intensive applications,

including knowledge discovery analytics on big data,

and queries searching for complex patterns in sequences, data streams and graphs.

Moreover, the volume of data and the complexity of the analytical tasks

they must support are now growing at such a torrid rate that

the vigorous progress in performance and scalability of computer systems

cannot keep up with it.

This situation calls for

(i) effective optimization techniques to reduce the cost of complex pattern queries, and

(ii) approximation techniques to produce results of predictable accuracy

using a small subset of the data.

In this dissertation we

(i) introduce new query languages and optimization techniques

for pattern matching in sequences, data streams and graphs, and

(ii) formulate a general approximation model for analytics queries.

Thus, in this dissertation we have made the following contributions:

(i) We have designed and demonstrated optimized implementation techniques

for K*SQL and XSeq, which provide a unified framework for complex pattern searching

on relational and XML DBs, respectively.

In particular, we introduced efficient execution exploiting recent advances

in automata theory known as Nested Words.

(ii) We have designed and demonstrated efficient scalable graph search engine

based on novel distributed memory-based system architecture,

and exploit graph exploration operations for implementing

an efficient graph search algorithm.

(iii) We have introduced support for bootstrap methods in MapReduce.

Bootstrap is a very useful estimation technique for sampling-based approximation.

Thus we designed the EARL of Hadoop system, that facilitates and optimizes

the use bootstrap methods on parallel MapReduce systems.

(iv) We have then invented and demonstrated an analytical model for bootstrap,

whereby the Monte-Carlo evaluation of the standard method

is replaced by a probabilistic query.

Thus, we provided a semiring-based extension of relational algebra and

related query optimization techniques to support fast execution

of the resulting probabilistic query.

We finally developed an Analytical Bootstrap System (ABS) for parallel

and distributed computing platforms.

ABS is applicable to most relational database queries and

delivers very accurate estimates at speeds that

outperforms the traditional bootstrap method by orders of magnitude.

Cover page: Approximation and Search Optimization on Massive Data Bases and Data Streams

Thesis
Peer Reviewed

Query Language Extensions for Advanced Analytics on Big Data and their Efficient Implementation

GU, JIAQI
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2019)

Advanced analytics and other Big Data applications call for query languages that can express the complex logic of advanced analytics, and are also amenable to efficient implementations providing high throughput and low latency. Existing systems such as Hadoop or Spark can now handle large amounts of data via MapReduce enabled parallelism, but they lack simple query languages that can express declaratively applications such as common graph and data mining algorithms, and the search for complex patterns in massive data sets. Fortunately, recent advances in recursive query languages and automata theory have paved way for extending widely used declarative query languages, such as SQL, to address these problems. Thus, in this dissertation, we propose two significant new extensions to the current SQL standards and demonstrate their efficient implementations. We first propose the Recursive-aggregate-SQL language, RaSQL. RaSQL queries assure a declarative formal fixpoint semantics that is guaranteed by the PreM property, while amenable to efficient recursive query evaluation techniques based on the Semi-Naive optimization for the fixpoint computation. The RaSQL is implemented on top of Apache Spark, achieving superior scalability and performance compared to the state-of-art systems such as Apache Giraph, GraphX and Myria. Then, we propose a new Weighted Search Pattern language, WSP, which extends the SQL-TS language. WSP is able to provide semantic rankings of the query results, and its implementation and optimization are guided by the theory of weighted automata.

Cover page: Query Language Extensions for Advanced Analytics on Big Data and their Efficient Implementation

Thesis
Peer Reviewed

Integration, Provenance, and Temporal Queries for Large-Scale Knowledge Bases

Gao, Shi
Advisor(s): Zaniolo, Carlo

UCLA Electronic Theses and Dissertations (2016)

Knowledge bases that summarize web information in RDF triples deliver many benefits, including support for natural language question answering and powerful structured queries that extract encyclopedic knowledge via SPARQL. Large scale knowledge bases grow rapidly in terms of scale and significance, and undergo frequent changes in both schema and content. Two critical problems have thus emerged: (i) how to support temporal queries that explore the history of knowledge bases or flash-back to the past; (ii) how to integrate knowledge from difference sources and improve the quality of integrated knowledge base while preserving the provenance information. In this dissertation, we propose a framework that supports knowledge integration, temporal query evaluation and user-friendly interfaces for large-scale knowledge bases. Towards this goal, we make the following contributions:

(i) We propose SPARQLT, a temporal extension of structured query language SPARQL based on a point temporal model which simplifies the expression of temporal joins and eliminates the need for temporal coalescing. This approach makes possible an end-user interface HKB (Historical Knowledge Browser) where users can browse the evolution history of knowledge bases and express historical queries via simple by-example conditions in the infoboxes of Wikipedia pages.

(ii) We have designed and implemented RDF-TX (RDF Temporal eXpress), an efficient system for managing temporal RDF data and evaluating SPARQLT queries. RDF-TX takes advantage of compressed Multiversion B+ trees to achieve fast evaluation of temporal queries. The experimental result demonstrates that our indexing and query optimization techniques deliver superior performance over other systems.

(iii) We propose a framework for knowledge extraction and integration. We first introduce IBMiner, a novel NLP-based system that derives knowledge bases from free text and preserves the provenance of extracted triples. IBminer uses a deep NLP-based approach to extract subject-attribute-value triples from free text, and maps the attributes to those introduced in existing knowledge bases. Then we integrate public knowledge bases with the knowledge base generated by IBMiner into one of superior quality and coverage, called IKBStore. User-friendly interfaces are provided to manage the knowledge in IKBStore while maintaining provenance information.

Cover page: Integration, Provenance, and Temporal Queries for Large-Scale Knowledge Bases