Single thread performance in the multi-core era
- Author(s): Porter, Leonard Emerson;
- et al.
The era of multi-core processors has begun. These multi- core processors represent a significant shift in processor design. This shift is a change in the design focus from reducing individual program (thread) latency to improving overall workload throughput. For over three decades, programs automatically ran faster on each new generation of processor because of improvements to processor performance. However, in this last decade, many of the techniques for improving processor performance reached their end. As a result, individual core performance has become stagnant, causing diminished performance gains for programs which are single-threaded. This dissertation focuses on improving single-thread performance on parallel hardware. To that end, I first introduce modifications to a new form of parallel memory hardware, Transactional Memory, which can improve the viability of Speculative Multithreading - a technique for using idle cores to improve singlethreaded execution time. These modifications to Transactional Memory improve Speculative Multithreading effectiveness by a factor of three. I further improve the performance of Speculative Multithreading by addressing a primary source of performance loss - the loss of thread state due to frequent thread migrations between cores. By predicting the cache working-set at the point of migration, we can improve overall program performance by nine percent. Recognizing the demand for transistors to be dedicated to shared or parallel resources (more cores, better interconnect, larger shared caches), I next propose a method of improving branch prediction accuracy for smaller branch predictors. I demonstrate that there are regions of program execution where long histories hurt prediction accuracy. I provide effective heuristics for predicting these regions - in some cases enabling comparable accuracies from predictors of half the size. I then address the problem of contention among coscheduled threads for shared multi-core resources. To reduce resource contention, I propose a new technique for thread scheduling on multi-core processors with shared last level caches which improves the overall throughput, energy efficiency, and fairness of the coschedule