The multicore era has initiated a move to ubiquitous parallelization of software. In the process, cores have scaled out but the memory subsystem resources have not kept up. Memory subsystem contention within and between applications makes it challenging to extract performance scaling that matches the increase in the number of cores. This dissertation explores the diagnosis of memory subsystem contention, identifies associated performance and energy efficiency opportunities, and suggests techniques and optimizations to both precisely measure and reduce the contention. The dissertation begins by exploring contention within a single and between multiple, large-scale, distributed scientific applications and moves to exploring the impact of memory subsystem contention on graphics processing units, accelerators that are seeing increasing usage in both commercial data centers and scientific clusters. The findings of the studies demonstrate that memory subsystem contention is a serious impediment to achieving high performance and energy efficiency but also that relatively simple techniques that control job placement, resource sharing, tuning of parallelism, and algorithmic optimizations at the application level provide significant opportunities to improve performance and energy efficiency.
The dissertation comprises four distinct works. (1) It begins by quantifying the performance and energy efficiency opportunities afforded by co-scheduling large-scale distributed scientific applications within a supercomputer. (2) From there, it studies the design of a prototype system for dynamically quantifying inter-application interference between co-located supercomputer jobs and uses those estimates to reform the accounting system to more fairly reflect end-user utility. (3) Next, it explores performance and energy scaling of analytic database workloads on graphics processors and finds that disabling whole compute units can reduce query execution time and reduce energy by throttling back the number of threads at any instant that contend for shared hardware resources. (4) The dissertation ends by describing the Horton table, a hash table that accelerates in-memory data-intensive computing by more efficiently using hardware cache and memory bandwidth.