Today's software systems often have poor reliability. In addition to losses of billions, software defects are responsible for a number of serious injuries and deaths in transportation accidents, medical treatments, and defense operations. The situation is getting worse with concurrency and distributed computing becoming integral parts of many real-world software systems. The non-determinism in concurrent and distributed systems and the unreliability of the hardware environment in which they operate can result in defects that are hard to find and understand.
In this thesis, we have developed tools and techniques to augment testing to enable it to quickly find and reproduce important bugs in concurrent and distributed systems. Our techniques are based on the following two key ideas: (i) use program analysis to increase coverage by predicting bugs that could have occurred in "nearby" program executions, and (ii) provide programming abstractions to enable testers to easily express their insights to guide testing towards those executions that are more likely to exhibit bugs or help achieve testing objectives without having any knowledge about the underlying testing process. The tools that we have built have found many serious bugs in large real-world software systems (e.g. Jigsaw web server, JDK, JGroups, and Hadoop File System).
In the first part of the thesis, we describe how we can predict and confirm bugs in the executions of concurrent systems that did not show up during testing but that could have shown up had the program under consideration executed under different thread schedules.This improves the coverage of testing, and helps find corner-case bugs that are unlikely to be discovered during traditional testing. We have built predictive testing tools to find different classes of serious bugs like deadlocks, hangs, and typestate errors in concurrent systems.
In the second part of the thesis, we investigate how we can improve the efficiency of testing of distributed cloud systems by letting testers guide testing towards the executions that are interesting to them. For example, a tester might want to test those executions that are more likely to be erroneous or that are more likely to help her achieve her testing objectives. We have built tools and frameworks that enable testers to easily express their knowledge and intuition to guide testing without having any knowledge about the underlying testing process. We have investigated programmable testing tools in the context of testing of large-scale distributed systems.