In today’s data-centric world, most of the global data resides in the cloud, accessible through various online services. Whether it’s browsing medical advice on WebMD, referencing information on Wikipedia, managing investments through online stock brokers, or streaming content from platforms like Netflix and YouTube, our online interactions are vast. Ensuring privacy in these contexts is crucial because the knowledge of a user’s accessed content could potentially reveal sensitive private information about the user. However, existing privacy measures employed by online services often fall short, leading to incidents where providers misuse and share users’ sensitive data with external parties. Moreover, powerful entities, including nation-states and government agencies, frequently engage in mass surveillance by collecting such data. The only way to defend against such powerful privacy risks is to design these services such that the service provider is oblivious to our data. Although recent cryptography research offers techniques for such systems, their adoption in mass Internet applications remains limited due to scalability issues. As a result, a significant tension has prevailed between privacy and scalability for online services over the past few decades. This dissertation centers on the following fundamental question: can we break this tension and build data systems that guarantee strong privacy to the users while ensuring both real-world scalability and practical performance?
This dissertation explores the tension between privacy and scalability in three application areas— voice communication, accessing Wikipedia articles, and retrieving content from a cloud service provider’s key-value database. Through the development of three new systems — Addra, Coeus, and Pantheon — it substantially improves the tension in these application areas. These systems are designed on the fundamental principle of composing cryptographic building blocks in a manner that efficiently leverages their diverse strengths and weaknesses. Consequently, they represent a notable advancement in this field.
Addra is a system that facilitates voice communication while hiding the associated metadata, such as the knowledge of who is communicating with whom, as well as the time and duration of the call. This metadata contains rich information about people’s lives and therefore is a prime target for powerful adversaries such as nation states. Existing systems that hide voice call metadata either require trusted intermediaries in the network or scale to only tens of users. Addra is the first system for voice communication that hides metadata over fully untrusted infrastructure and scales to tens of thousands of users. At a high level, Addra follows a template in which callers and callees deposit and retrieve messages from private mailboxes hosted at an untrusted server. However, Addra improves message latency in this architecture, which is a key performance metric for voice calls. First, it enables a caller to push a message to a callee in two hops, using a new way of assigning mailboxes to users that resembles how a post office assigns PO boxes to its customers. Second, it innovates on the underlying cryptographic machinery and constructs a new private information retrieval scheme, FastPIR, that reduces the time to process oblivious access requests for mailboxes. An evaluation of Addra on a cluster of 80 machines on AWS demonstrates that it can serve 32K users with a 99-th percentile message latency of 726 ms—a 7× improvement over a prior system for text messaging in the same threat model.
Coeus addresses a fundamental abstract problem termed as oblivious document ranking and retrieval. The problem is stated as follows: given a confidential string q and a remote server containing a collection of public documents D, how can one effectively select and access one of the top K most relevant documents to q within D without revealing any information about q or the chosen document, even to the server itself? At a high level, Coeus composes two cryptographic primitives: secure matrix-vector product for scoring document relevance using the widely-used term frequency-inverse document frequency (tf-idf) method, and private information retrieval (PIR) for obliviously retrieving documents. However, Coeus reduces the time to run these protocols, thereby improving the user-perceived latency, which is a key performance metric. Coeus first reduces the PIR overhead by separating out private metadata retrieval from document retrieval, and it then scales secure matrix-vector product to tf-idf matrices with several hundred billion elements through a series of novel cryptographic refinements and an efficient workload distribution strategy. For a corpus of English Wikipedia containing 5 million documents, a keyword dictionary with 64K keywords, and on a cluster of 143 machines on AWS, Coeus enables a user to obliviously rank and retrieve a document in 3.9 seconds—a 24× improvement over a baseline system.
The third contribution of this dissertation, Pantheon, addresses the problem of preserving client privacy in a scenario where a cloud server manages a key-value store and offers a private query service to clients. Ensuring client privacy in this setting is difficult because the key-value store is public, and a client cannot encrypt or modify it. Therefore, privacy in this context implies hiding the accesses pattern of a client. Pantheon cryptographically allows a client to retrieve the value corresponding to a key from a public key-value store without allowing the server or any adversary to know any information about the key or value accessed. Pantheon devises a single-round retrieval protocol which reduces server-side latency by refining its cryptographic machinery and massively parallelizing the query execution workload. Using these novel techniques, Pantheon achieves a 93× improvement for server-side latency over a state-of-the-art solution.