Performance problems commonly exist in many kinds of real-world applications, including smartphone apps, databases, web servers, as well as large-scale data analytical systems. The pervasive use of managed languages such as Java and C#, which often blow up memory usage in unpredictable ways, further exacerbates these problems. A great deal of evidence from various application communities shows that seemingly insignificant performance problems can lead to severe scalability reductions and even financial losses.
Performance problems are notoriously difficult to find and fix in real-world systems. First, visible performance degradation is often an accumulation of the effects of a great number of invisible problems that scatter all over the program. These problems only manifest for particular workloads or when the input size is large enough. Hence, it is extremely difficult, if not impossible, for developers to catch performance problems early during testing before they go out to production and are experienced by users. Fixing these problems is equally difficult. Developing a fix requires the understanding of the root cause, which can be both time-consuming and labor-intensive given the complexity of large software systems. Furthermore, for modern applications such as data analytical systems, application developers often write only a simple layer of user logic (e.g., Map/Reduce programs), treating the underlying system as a blackbox. It would not be possible to come up with a fix if the problem is located deeply in the system code.
There is a rich literature in techniques that can find and fix performance problems in managed programs. However, the existing works all suffer from one or more of the following drawbacks: (1) lacking a general way to describe different kinds of performance problems, (2) lacking effective test oracles that can capture invisible performance problems under small workloads, (3) lacking effective debugging support that can help developers find the root cause when a bug manifests, and (4) lacking a systematic approach that can effectively tune memory usage in data-intensive systems. As a result, performance bugs still escape to production runs, hurt user experience, degrade system throughput, and waste computational resources. For modern Big Data systems, most of which are written in managed languages, performance problems become increasingly critical, causing scalability degradation, unacceptable latency, and even execution failures leading to wide-range crashes and financial losses.
In this dissertation, we propose a set of dynamic techniques that can help developers find and fix memory-related performance problems in both programs running on a single machine and data-intensive systems deployed on large clusters. Specifically, this dissertation makes the following three major contributions. The first contribution is the design of an instrumentation specification language (ISL) in which one can easily describe the symptoms and counter-evidence of performance problems. The language supports two primary actions amplify and deamplify, which can help developers capture invisible problems by “blowing up” the effect of the performance bugs and reduce false warnings. The second contribution is the development of a general performance testing framework named PerfBlower, which can efficiently profile program executions, continuously amplify the problems specified with the ISL language, and report reference-path-based diagnostic information, making it easy for the user to understand the root cause of a reported problem and develop a fix. The third contribution is the design and implementation of interruptible task, a new type of data-parallel tasks that can be interrupted upon memory pressure—with part or all of their used memory reclaimed—and resumed when the pressure goes away, to help large-scale data analytical systems such as Hadoop and Hyracks survive memory pressure.
To evaluate these techniques, we have performed an extensive set of experiments using real-world programs and datasets. Our experimental results demonstrate that the techniques proposed in this dissertation can effectively detect and fix memory-related performance problems in both single-machine programs and distributed data-parallel systems. These techniques are also easy to use—users can quickly describe performance problems using ISL and/or develop interruptible tasks to improve the performance of their application code without understanding the underlying systems. These techniques can be readily employed in practice to help real-world developers in various testing and tuning scenarios.