Modular and Safe Event-Driven Programming
- Author(s): Desai, Ankush Pankaj
- Advisor(s): Seshia, Sanjit
- et al.
Asynchronous event-driven systems are ubiquitous across domains such as device drivers, distributed systems, and robotics. These systems are notoriously hard to get right as the programmer needs to reason about numerous control paths resulting from the complex interleaving of events (or messages) and failures. Unsurprisingly, it is easy to introduce subtle errors while attempting to fill in gaps between high-level system specifications and their concrete implementations.
This dissertation proposes new methods for programming safe event-driven asynchronous systems.
In the first part of the thesis, we present ModP, a modular programming framework for compositional programming and testing of event-driven asynchronous systems.
The ModP module system supports a novel theory of compositional refinement for assume-guarantee reasoning of dynamic event-driven asynchronous systems.
We build a complex distributed systems software stack using ModP.
Our results demonstrate that compositional reasoning can help scale model-checking (both explicit and symbolic) to large distributed systems.
ModP is transforming the way asynchronous software is built at Microsoft and Amazon Web Services (AWS).
Microsoft uses ModP for implementing safe device drivers and other software in the Windows kernel.
AWS uses ModP for compositional model checking of complex distributed systems.
While ModP simplifies analysis of such systems, the state space of industrial-scale systems remains extremely large.
In the second part of this thesis, we present scalable verification and systematic testing approaches to further mitigate this state-space explosion problem.
First, we introduce the concept of a delaying explorer to perform prioritized exploration of the behaviors of an asynchronous reactive program.
A delaying explorer stratifies the search space using a custom strategy (tailored towards finding bugs faster), and a delay operation that allows deviation from that strategy. We show that prioritized search with a delaying explorer performs significantly better than existing approaches for finding bugs in asynchronous programs.
Next, we consider the challenge of verifying time-synchronized systems; these are almost-synchronous systems as they are neither completely asynchronous nor synchronous.
We introduce approximate synchrony, a sound and tunable abstraction for verification of almost-synchronous systems.
We show how approximate synchrony can be used for verification of both time-synchronization protocols and applications running on top of them.
Moreover, we show how approximate synchrony also provides a useful strategy to guide state-space exploration during model-checking.
Using approximate synchrony and implementing it as a delaying explorer, we were able to verify the correctness of the IEEE 1588 distributed time-synchronization protocol and, in the process, uncovered a bug in the protocol that was well appreciated by the standards committee.
In the final part of this thesis, we consider the challenge of programming a special class of event-driven asynchronous systems -- safe autonomous robotics systems.
Our approach towards achieving assured autonomy for robotics systems consists of two parts: (1) a high-level programming language for implementing and validating the reactive robotics software stack; and (2) an integrated runtime assurance system to ensure that the assumptions used during design-time validation of the high-level software hold at runtime.
Combining high-level programming language and model-checking with runtime assurance helps us bridge the gap between design-time software validation that makes assumptions about the untrusted components (e.g., low-level controllers), and the physical world, and the actual execution of the software on a real robotic platform in the physical world. We implemented our approach as DRONA, a programming framework for building safe robotics systems.
We used DRONA for building a distributed mobile robotics system and deployed it on real drone platforms.
Our results demonstrate that DRONA (with the runtime-assurance capabilities) enables programmers to build an autonomous robotics software stack with formal safety guarantees.
To summarize, this thesis contributes new theory and tools to the areas of programming languages, verification, systematic testing, and runtime assurance for programming safe asynchronous event-driven across the domains of fault-tolerant distributed systems and safe autonomous robotics systems.