A proliferation of frameworks have emerged to handle the challenges of making distributed computations reliable and scalable. These enable users to easily perform analysis of large datasets on commodity clusters. As users have demanded better response times for these computations, newer versions of these frameworks have focused on efficiently keeping computation in memory. A major challenge in deploying such frameworks is understanding application memory requirements. Just as the layers of abstraction of frameworks assist in writing efficient and robust applications, they hide the true memory requirements.
In this dissertation, I describe and evaluate SLAMR, a tool I have developed for providing users with memory recommendations for programs written for the Apache Spark analytics stack. These recommendations practically address the lack of visibility users have into memory requirements even as they are asked to assign resources to deploy their analytics programs. My tool records activity from an example execution, both from the framework and the garbage collector of its underlying language runtime. Given this instrumentation, it estimates a memory configuration that effectively keeps the entire computation in memory, such that allocating more memory would have minimal benefit on performance. Because in-memory analytics frameworks are built to take advantage of the memory available to them, simply observing actual memory usage is not an effective way to produce such estimates. A challenge, then, is to produce these estimates without performing effort similar to trying configurations around the ultimate recommendation.
SLAMR provides these recommendations without requiring many example executions or detailed analysis of the semantics of the program. What I instrument allows it to predict the effect of different memory configurations rather than simply reflecting the available memory. It collects information corresponding to the abstractions provided by frameworks, so it can distinguish which memory usage is useful and account for when alternate storage was used. To understand the requirements of the underlying language runtime in this analytics stack, it also collects statistics about the program execution that can replayed into a dramatically simplified model of a garbage collector. Both of these models are built around the goal of providing a conservative bound, allowing users to use the resulting memory recommendations in confidence rather than inflate them to avoid the risk of memory exhaustion from errors in the estimates. I show through evaluation that SLAMR provides effective, consistently safe recommendations for a variety of analytics programs and does so with minimal measurement overhead.