Towards a Library for Deterministic Failure Testing of Distributed Systems
Distributed systems are widespread today, and they are being used to serve millions of customers and process huge amounts of data. These systems run on commodity hardware and in an environment with many uncertainties, e.g., partial network failures and race condition between nodes. Testing distributed systems requires new test libraries that take into account these uncertainties and can reproduce scenarios with specificc timing constraints in a programming-language-agnostic way. To this end, we present Failify, a cross-platform, programming-language-agnostic and deterministic failure testing library for distributed systems, which can be seamlessly integrated into different build systems. Failify, as an infrastructure, can also facilitate research in testing distributed systems in various ways. We experimented with six open-source distributed systems to show the compactness of the Failify's deployment API. Our results indicate that, in average, the most reliable deployment architecture for these systems can be defined in less that 17 lines of code. We also experimented with HDFS to demonstrate potential scenarios where Failify's deterministic environmental manipulation and failure injection API can be effective.