Today’s large-scale services generally exploit loosely-coupled
architectures that restrict functionality requiring tight cooperation (e.g.,
leader election, synchronization, and reconfiguration) to a small subset of
nodes. In contrast, this work presents a way to scalably deploy tightly-coupled
distributed systems that require significant coordination among a large number
of nodes in the wide area. Our design relies upon a new reliable group
membership abstraction to ensure that either group members are capable of
communicating or that new groups form. In particular, we deploy a distributed
rate limiting (DRL) service within a global testbed infrastructure. Unlike most
distributed services, DRL can safely operate in separate partitions
simultaneously, but requires timely snapshots of global state within each. Our
DRL implementation leverages our proposed group membership abstraction and a
robust gossip-based communication protocol, conjoining the fates of view
maintenance and data dissemination. Through local and wide-area experiments, we
illustrate that DRL remains accurate and responsive in the face of a variety of
failure scenarios.
Pre-2018 CSE ID: CS2012-0973