Search

Scholarly Works (4 results)

Sort By:

Thesis
Peer Reviewed

Big Graph Analytics on Just A Single PC

Wang, Kai
Advisor(s): Xu, Harry Guoqing

UCLA Electronic Theses and Dissertations (2019)

As graph data becomes ubiquitous in modern computing, developing systems to efficiently process large graphs has gained increasing popularity. There are two major types of analytical problems over large graphs: graph computation and graph mining. Graph computation includes a set of problems that can be represented through liner algebra over an adjacency matrix based representation of the graph. Graph mining aims to discover complex structural patterns of a graph, for example, finding relationship patterns in social media network, detecting link spam in web data.

Due to their importance in machine learning, web application and social media, graph analytical problems have been extensively studied in the past decade. Practical solutions have been implemented in a wide variety of graph analytical systems. However, most of the existing systems for graph analytics are distributed frameworks, which suffer from one or more of the following drawbacks: (1) many of the (current and future) users performing graph analytics will be domain experts with limited computer science background. They are faced with the challenge of managing a cluster, which involves tasks such as data partitioning and fault tolerance they are not familiar with; (2) not all users have access to enterprise cluster in their daily development tasks; (3) distributed graph systems commonly suffer from large startup and communication overhead; and (4) load balancing in a distributed system is another major challenge. Some graph algorithms have dynamic working sets and and it is thus hard to distribute the workload appropriately before the execution.

In this dissertation, we identify three categories of graph workloads for which single-machine systems are more suitable than distributed systems: (1) analytical queries that do not need exact answers; (2) program analysis tasks that are widely used to find bugs in real-world software; and (3) graph mining algorithms that are important for many information-retrieval tasks.

Based on these observations, we have developed a set of single-machine graph systems to deliver efficiency and scalability specifically for these workloads. In particular, this dissertation makes the following contributions. The first contribution is the design and implementation of a single-machine graph query system named GraphQ, which divides a large graph into partitions and merges them with the guidance from an abstraction graph. By using multiple levels of abstraction, it can quickly rule out infeasible solutions and identify mergeable partitions. GraphQ uses the memory capacity as a budget and tries its best to find solutions before exhausting the memory, making it possible to answer analytical queries over very large graphs with resources affordable to a single PC. The second contribution is the design and implementation of Graspan, a single-machine, disk-based graph processing system tailored for interprocedural static analyses. Given a program graph and a grammar specification of an analysis, Graspan uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. With the help of novel graph processing techniques, we turn sophisticated code analyses into scalable Big Graph analytics. The third contribution of this dissertation is a single-machine, out-of-core graph mining system, called RStream, which leverages disk support to support efficient edge streaming for mining very large graphs. RStream employs a rich programming model that exposes relational algebra for developers to express a wide variety of mining tasks and implements a runtime engine that delivers efficiency with tuple streaming.

In conclusion, this dissertation attempts to explore the opportunities of building single-machine graph systems for scenarios where distributed systems do not work well. Our experimental results demonstrate that the techniques proposed in this dissertation can efficiently solve big graph analytical problems on a single consumer PC. We hope that these promising results will encourage future work to continue building affordable single-machine systems for a rich set of datasets and analytical tasks.

Cover page: Big Graph Analytics on Just A Single PC

Thesis
Peer Reviewed

Making Video Analytics Applications Efficient and Affordable

UCLA Electronic Theses and Dissertations (2022)

While using machine learning to analyze video data is seeing explosive growth, modern vision models are difficult and expensive to deploy in practice. This is because while models are getting more accurate and robust, they are also getting more complicated and thus more resource-intensive. At the same time, the environments in which they are used, such as self-driving cars, demand extremely fast and accurate results.

Traditionally, all video data was sent to cloud servers, where models were run over the frames on GPU machines. Recently though, the use of edge computing has shown promise in addressing this tension between performance and resource usage. Resources available at the edge are highly heterogeneous in terms of computational power and memory, and while most prior work assumes a well-equipped edge, we find that the devices used in practice are often inexpensive commodity hardware. This limits the amount of computation that can practically happen at the edge.

In this thesis, we aim to make the most of these resource-constrained edge devices. We present two systems that improve the tradeoff between performance and resource usage in live video analysis. Our first system, Reducto, uses the limited amount of compute available on smart cameras to run cheap computer vision techniques and filter out frames that are similar enough to the previous frame that we can reuse the previously computed result as an approximation. This lowers GPU usage by over 50% and doubles processing speed. Our next system, GEMEL, addresses the memory bottleneck of running many models on an edge server by finding and merging common layers across a diverse set of models. This lowers the memory footprint by up to 60% and improves accuracy by up to 39%.

Cover page: Making Video Analytics Applications Efficient and Affordable

Thesis
Peer Reviewed

Efficient, Affordable, and Scalable Deep Learning Systems

Thorpe, John Vincent
Advisor(s): Xu, Harry Guoqing

UCLA Electronic Theses and Dissertations (2022)

Deep Learning has become one of the most important tools in computer science in the last decade because of its ability to perform tasks that involve complex reasoning. Researchers have been pushing forward the state of the art in many ways, but we focus on two major trends that have shown promise. The first is creating models which can change based on the inputs to better learn patterns in the data. For example, with Graph Neural Networks, training the same model over two different graphs will result in models which have been specialized to the structure of the particular graph. This specialization results in the same model architectures having very different computation patterns depending on the input graph. The second is the rapidly increasing size of models, measured by the number of parameters. This increase leads to much more general and accurate machine learning agents, but requires a large amount of hardware to train efficiently. Because of these complexities, these trends result in state of the art models requiring massive amounts of computational and financial resources, limiting them to larger well-funded companies that have the funds to experiment.

Lowering the financial barrier to entry for experimentation with these models will allow many more small businesses and research labs to experiment with them. The current paradigm of machine learning focuses on throwing high-end computational power at the Deep Learning problem as the resource requirements grow. However, a key observation for addressing this problem lies in the increasingly heterogenous offering made by cloud providers. Cloud platforms offer a much more diverse set of resources than would be available to most on-site clusters allowing for a more fine grained approach to the problem. By combining the heterogeneous offers of cloud providers with in-depth knowledge of the compute profile of deep learning models, we can achieve better value compared to most existing systems. Here, value is defined as the performance per dollar we can achieve.

My first system, Dorylus, a Graph Neural Network framework, provides up to 3.86x more value than its GPU based variant running on large graphs by employing serverless threads to do asynchronous computation. My next work, Bamboo, focuses on reducing the cost of training massive pipeline-parallel models by making preemptible instances resilient and performant by intelligently introducing computation redundancy into the pipelines. Bamboo was able to provide almost 2x the performance-per-dollar of training with full-priced non-preemptible instances, and provide 1.5x the performance-per-dollar of existing systems designed to run on spot instances.

Cover page: Efficient, Affordable, and Scalable Deep Learning Systems

Thesis
Peer Reviewed

Reinventing Datacenter System Stacks for Resource Harvesting

UCLA Electronic Theses and Dissertations (2024)

The rise of cloud computing and recent AI breakthroughs have radically expanded the demand for datacenter hardware resources, including CPU, memory, and accelerators such as GPUs. Despite the critical need to improve resource utilization and reduce operational cost, current datacenter system stacks—comprising OSes and runtime systems—struggle to fully utilize hardware resources due to high load variability and stringent performance requirements of datacenter workloads, leading to substantial waste of compute and memory resources.

This dissertation demonstrates that it is feasible to safely and efficiently harvest stranded datacenter resources, even when they are intermittently available and dispersed across servers. Specifically, we identify two previously overlooked resource harvesting opportunities in today's data center system stacks. First, although datacenter applications often have varying and potentially large resource demands, they typically include elastic components that can be safely discarded under resource pressure, making them ideal for utilizing idle resources with temporal availability. Existing operating systems and runtime systems, though, lack proper interfaces for applications to convey such semantics and take advantage of idle resources. Second, while the availability of resources per server is unpredictable, combining stranded resources across servers can offer better overall availability. However, this opportunity is unavailable to many datacenter workloads that were designed for running on a single machine.

Driven by these insights, this dissertation rethinks the datacenter system stack and introduces holistic designs for OS abstractions, the OS kernel, and application runtime systems for resource harvesting. The contributions of this dissertation are fourfold.

First, we investigate how to harvest resources, especially memory which is inelastic and hard to re-assign between applications, within a single server. We introduce Midas, an OS memory abstraction that allows applications to use idle memory for storing their soft state. Midas efficiently manages soft memory with a kernel-runtime co-design, achieving near-optimal performance for four real-world datacenter applications and responding to extreme memory pressure quickly enough to avoid running out of memory.

Second, we explore how to harvest resources across servers. We present Hermit, a redesigned OS kernel paging/swap system that enables applications to harvest idle memory on remote servers with full transparency and efficiency. Hermit allows any application to harness remote memory without changing a single line of code, making it practical for legacy real-world datacenter applications. It also achieves three orders of magnitude lower tail latency and up to 1.87 times higher throughput for latency-critical and batch-processing applications, respectively.

Third, built atop Hermit, Canvas is a resource isolation mechanism for the kernel swap system that allows multiple applications to share remote memory without performance interference. By segregating resource usages and access patterns of co-running applications, Canvas further adaptively optimizes kernel swap for each application. Our evaluation and performance study with a wide range of datacenter applications demonstrate that Canvas reduces performance variation by a factor of 7 and improves their throughput by an average of 3.5 times when multiple applications share remote memory.

Finally, we demonstrate that our insights can be generalized to accelerators and emerging AI workloads. We develop Concerto, a preemptive GPU runtime for large language model serving that harnesses idle GPU resources for offline inference tasks. By opportunistically batching offline inference tasks when online serving cannot fully saturate GPUs, Concerto significantly increases GPU utilization by an average of 2.35 times. By reactively preempting offline tasks upon online load bursts, Concerto reduces online serving latency by two orders of magnitude.

Together, these systems form a new datacenter system stack that synergistically enhances performance, resource utilization, and cost efficiency, offering a transformative approach to modern datacenter management.

Cover page: Reinventing Datacenter System Stacks for Resource Harvesting