Skip to main content
Open Access Publications from the University of California
Cover page of Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Preble: Efficient Distributed Prompt Scheduling for LLM Serving


Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices are to include domain-specific instructions, illustration of tool usages, and long context such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests, and their attention computation results can be reused. However, today's LLM serving systems treat every request in isolation, missing the opportunity of computation reuse.

This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We perform a study on five popular LLM workloads. Based on our study results, we designed a distributed scheduling system that co-optimizes computation reuse and load balancing. Our evaluation of Preble on two to 8 GPUs with real workloads and request arrival patterns on two open-source LLM models shows that Preble outperforms the state of the art avg latency by 1.5x to 14.5x and p99 by 2x to 10x.

Cover page of Analysis of Targeted Advertising in Snapchat Political Ads

Analysis of Targeted Advertising in Snapchat Political Ads


Snapchat is one of the most popular social media apps in the world. It is no surprise, then, that many political ads are run on the service each year. Snap Inc.'s political ads library is part of an effort by the company to increase transparency in their advertising practices. The data analyzed in this project spans 2019-2020, and consists of information on every political ad that was run on the service in that timeframe, including who the ad buyer was, how much the ad cost, what areas it targeted, etc. Geographic and monetary distribution of ads is analyzed and possible explanations given for anomalies. Missingness of the data was evaluated and Vermont was identified as an area with unusual spending. With α = 0.05 the null hypothesis was rejected (p = 0.02); the distribution of ad dollars to Vermont is not wholly random.

Cover page of Development of Algorithm to Predict Political Ad Spending on Snapchat

Development of Algorithm to Predict Political Ad Spending on Snapchat


The Snapchat ads dataset contains political ad data for ads on Snapchat, oneof the largest social media networks in the world. A key feature of the datasetis how much money an organization spends on a particular ad, found in the`Spend` column. It is reasonable to assume that this amount varies based oncertain factors, but can we use those factors to figure out how much is spenton an ad? We can explore this by predicting ad spending through machinelearning. After feature analysis and engineering, we arrive at a linearregression model with $R^2$ = .85 and perform a fairness evaluation of thealgorithm.

Cover page of Estimating Profitability of Alternative Crypto-currencies

Estimating Profitability of Alternative Crypto-currencies


Digital currencies have flourished in recent years, buoyed by the tremendous success of Bitcoin. These blockchain-based currencies, called altcoins, have attracted enthusiasts who enter the market by mining or buying them. To mine or to buy, however, can be a difficult decision; each altcoin is different from another, and the market tends to be volatile. In this work, we analyze the profitability of mining and speculation for 36 altcoins using real-world blockchain and trade data. Using opportunity cost as a metric, we estimate the mining cost for a coin with respect to a more popular coin. For every dollar invested in mining or buying a coin, we also estimate the revenue under various conditions, such as time of market entry and hold positions. While some coins offer the potential for spectacular returns, many follow a simple bubble-and-crash scenario, which highlights the extreme risks---and potential gains---in altcoin markets.

Pre-2018 CSE ID: CS2017-1019

Cover page of Hardening the NOVA File System

Hardening the NOVA File System


Emerging fast, persistent memories will enable systems that combine conventional DRAM with large amounts of non-volatile main memory (NVMM) and provide huge increases in storage performance. Fully realizing this potential requires fundamental changes in how system software manages, protects, and provides access to data that resides in NVMM. We address these needs by describing a NVMM-optimized file system called NOVA that is both fast and resilient in the face of corruption due to media errors and software bugs. We identify and propose solutions for the unique challenges in hardening an NVMM file system, adapt state-of-the-art reliability techniques to an NVMM file system, and quantify the performance and storage overheads of these techniques. We find that NOVA's reliability features increase file system size system size by 14.9% and reduce application-level performance by between 2% and 38%.

Pre-2018 CSE ID: CS2017-1018

Cover page of Echidna: Programmable Schematics to Simplify PCB Design

Echidna: Programmable Schematics to Simplify PCB Design


In this paper we introduce Echidna, a hybrid schematic/ text-based language for describing PCB circuit schematics. Echidna allows designers to use high-level programming con- structs to describe schematics, supports modular, reusable design components with well-defined interfaces, and provides for complex parameterization of those modules. Echidna deeply integrates a high-level programming language into a schematic-based design flow. The designer can describe schematics in code, as a schematic, or as a seamless combination of the two. We demonstrate its usefulness with several case studies.

Pre-2018 CSE ID: CS2016-1017

Cover page of ASIC Clouds: Specializing the Datacenter

ASIC Clouds: Specializing the Datacenter


GPU and FPGA-based clouds have already demonstrated the promise of accelerating computing-intensive workloads with greatly improved power and performance. In this paper, we examine the design of ASIC Clouds, which are purpose-built datacenters comprised of large arrays of ASIC accelerators, whose purpose is to optimize the total cost of ownership (TCO) of large, high-volume chronic computations, which are becoming increasingly common as more and more services are built around the Cloud model. On the surface, the creation of ASIC clouds may seem highly improbable due to high NREs and the inflexibility of ASICs. Surprisingly, however, large-scale ASIC Clouds have already been deployed by a large number of commercial entities, to implement the distributed Bitcoin cryptocurrency system. We begin with a case study of Bitcoin mining ASIC Clouds, which are perhaps the largest ASIC Clouds to date. From there, we design three more ASIC Clouds, including a YouTube-style video transcoding ASIC Cloud, a Litecoin ASIC Cloud, and a Convolutional Neural Network ASIC Cloud and show 2-3 orders of magnitude better TCO versus CPU and GPU. Among our contributions, we present a methodology that given an accelerator design, derives Pareto-optimal \AC{} Servers, by extracting data from place-and-routed circuits and computational fluid dynamic simulations, and then employing clever but brute-force search to find the best jointly-optimized ASIC, DRAM subsystem, motherboard, power delivery system, cooling system, operating voltage, and case design. Moreover, we show how data center parameters determine which of the many Pareto-optimal points is TCO-optimal. Finally we examine when it makes sense to build an ASIC Cloud, and examine the impact of ASIC NRE.

Pre-2018 CSE ID: CS2016-1016

Cover page of Gullfoss: Accelerating and Simplifying Data Movement among Heterogeneous Computing and Storage Resources

Gullfoss: Accelerating and Simplifying Data Movement among Heterogeneous Computing and Storage Resources


High-end computer systems increasingly rely on heterogeneous computing resources. For instance, a datacenter server might include multiple CPUs, high-end GPUs, PCIe SSDs, and high-speed networking interface cards. All of these components provide computing resources and operate at a high bandwidth. Coordinating the movement of data and scheduling computation across these resources is a complex task, as current programming models require system developers to explicitly schedule data transfers. Moving data is also inefficient in terms of both performance and energy costs: some applications running on GPU-equipped systems spend over 55% of their execution time and 53% of energy moving data between the storage device and the GPU. This paper proposes Gullfoss, a system that provides a simplified programming model for these heterogeneous computing systems. Gullfoss provides a high-level interface for specifying an application’s data movement requirements, and dynamically schedules data transfers while accounting for current system load and program requirements. Our initial implementation of Gullfoss focuses on data transfers between an SSD and a GPU, eliminating wasteful transfers to and from main memory as data moves between the two. This saves memory energy and bandwidth, leaving the CPU free to do useful work or operate at a lower frequency to improve energy efficiency. We implement and evaluate Gullfoss using commercially available hardware components. Gullfoss achieves 1.46× speedup, reduces energy consumption by 28%, and improves energy-delay product by 41%, comparing with systems without Gullfoss. For multi-program workloads, Gullfoss shows 1.5× speedup. Gullfoss also improves the performance of a GPU-based MapReduce framework by 10%.

Pre-2018 CSE ID: CS2015-1015

Cover page of Experience in Building a Comparative Performance Analysis Engine for a Commercial System

Experience in Building a Comparative Performance Analysis Engine for a Commercial System


Performance testing is a standard practice for evolving systems to detect performance issues proactively. It samples various performance metrics that will be compared with a stable baseline to judge whether the measurement data is abnormal. This type of comparative analysis requires domain expertise, which can take experienced performance analysts days to conduct. In an effort to build an automatic solution for a leading data warehousing company to improve the comparative performance analysis efficiency, we implemented machine learning approaches proposed by existing research. But the initial result has a 86% false negative rate on average, which means the majority of performance defects would be missed. To investigate causes for this unsatisfying result, we take a step back to revisit the performance data itself and find several important data related issues that are overlooked by existing work. In this paper, we discuss in detail these issues and share our hindsights to address them. With the new learning scheme we devise, we are able to reduce the false negative rate to as low as 16% and achieve a balanced accuracy of 0.91, which enables the analysis engine to be practically adopted.

Pre-2018 CSE ID: CS2015-1014

Cover page of Sorting 100 TB on Google Compute Engine

Sorting 100 TB on Google Compute Engine


Google Compute Engine offers a high-performance, cost-effective means for running I/O-intensive applications. This report details our experience running large-scale, high- performance sorting jobs on GCE. We run sort applications up to 100 TB in size on clusters of up to 299 VMs, and find that we are able to sort data at or near the hardware capabilities of the locally attached SSDs. In particular, we sort 100 TB on 296 VMs in 915 seconds at a cost of $154.78. We compare this result to our previous sorting experience on Amazon Elastic Compute Cloud and find that Google Compute Engine can deliver similar levels of performance. Although individual EC2 VMs have higher levels of performance than GCE VMs, permitting significantly smaller cluster sizes on EC2, we find that the total dollar cost that the user pays on GCE is 48% less than the cost of running on EC2.

Pre-2018 CSE ID: CS2015-1013