Scalability is a fundamental problem in computer science. Computer scientists often describe the scalability of algorithms in the language of theoretical computational complexity, bounding the number of operations an algorithm performs as a function of the size of its input. The main contribution of this dissertation is to provide an analogous description of the scalability of actual software implementations run on realistic workloads.
We propose a method for describing the asymptotic behavior of programs in practice by measuring their empirical computational complexity. Our method involves running a program on workloads spanning several orders of magnitude in size, measuring their performance, and fitting these observations to a model that predicts performance as a function of workload size. Comparing these models to the programmer's expectations or to theoretical asymptotic bounds can reveal performance bugs or confirm that a program's performance scales as expected.
We develop our methodology for constructing these models of empirical complexity as we describe and evaluate two techniques. Our first technique, BB-TrendProf, constructs models that predict how many times each basic block runs as a linear (y = a + b*x) or a powerlaw (y = a*x^b) function of user-specified features of the program's workloads. To present output succinctly and focus attention on scalability-critical code, BB-TrendProf groups and ranks program locations based on these models. We demonstrate the power of BB-TrendProf compared to existing tools by running it on several large programs and reporting cases where its models show (1) an implementation of a complex algorithm scaling as expected, (2) two complex algorithms beating their worst-case theoretical complexity bounds when run on realistic inputs, and (3) a performance bug.
Our second technique, CF-TrendProf, models performance of loops and functions both per-function-invocation and per-workload. It improves upon the precision of BB-TrendProf's models by using control flow to generate candidates from a richer family of models and a novel model selection criteria to select among these candidates. We show that CF-TrendProf's improvements to model generation and selection allow it to correctly characterize or closely approximate the empirical scalability of several well-known algorithms and data structures and to diagnose several synthetic, but realistic, scalability problems without observing an egregiously expensive workload. We also show that CF-TrendProf deals with multiple workload features better than BB-TrendProf. We qualitatively compare the output of BB-TrendProf and CF-TrendProf and discuss their relative strengths and weaknesses.