Query-based debugging of distributed systems
- Author(s): Braud, Ryan Evans
- et al.
One of the most challenging aspects of debugging distributed systems is understanding system behavior in the period leading up to a bug. Since traditional debuggers such as gdb are not well suited to distributed system debugging, developers often resort to annotating their code with log statements and then writing one-off scripts that perform ad-hoc searches through the logged data. To improve this cumbersome process, we propose that the state of a distributed system execution should be programmatically and interactively available for postmortem analysis. We observe that the three defining properties of entries in a distributed system's log are "time," "node identifier," and "event type," and treat the log as a logical cube with these dimensions. By exploiting the structure of this state matrix, developers can use a high-level query language to efficiently extract information instead of manually inspecting log files or writing log processing scripts. In this dissertation, we describe the debugging process based on a query-oriented approach. We begin with an introduction of the state matrix abstraction and show how it can capture useful properties of distributed systems' executions. We then present NyQL, an object-oriented query language operating over the contents of the state matrix and describe one possible implementation as a translation to SQL queries executed over a relational database. Next, we present an implementation of a logging system that generates queryable logs in Mace, a source-to-source translator and library for building distributed systems. We present techniques for mitigating the logging overhead by giving NyQL queries to the \mace translator and show that in many cases queries can be resolved in a few seconds. We then demonstrate how using NyQL simplified debugging a handful of bugs in two different distributed systems. Finally, we extend our logging techniques to systems without source-to -source translators by developing two general-purpose libraries &mdash one in C++ and one in Java. We describe the differences between all three systems in terms of functionality and ease of use and then conclude with some future directions for distributed systems debugging