CPU speeds double approximately every eighteen months, while main
memory speeds double only about every ten years. These diverging rates imply an
impending ``Memory Wall", in which memory accesses dominate program
performance. An effective way to reduce the observed latency of memory
accesses is data prefetching. Data prefetching involves predicting which data
the processor will use in the near future and then bring that data into storage
locations closer to the execution engine (e.g., caches, dedicated prefetch
buffers etc.). Meanwhile, applications are constantly becoming more complex and
demanding. This is reflected in programs through the use of various data
structures and constructs provided by the programming language. While caches
are extremely successful in handling accesses with good locality, the remaining
accesses result in costly cache misses. For these reasons, we performed a study
that maps the miss stream to several fundamental access types inherent to
programming constructs. The results of this study show that pointer-based
applications with accesses to linked data structures present the biggest
challenge to the memory subsystem. Consequently, they have the biggest
potential for speedup when their adverse effects on program performance are
removed by data prefetching. For these reasons, in this thesis, we focus on
data prefetching techniques targeting pointer-based applications which have
irregular access patterns. With a better understanding of the kinds of accesses
that cause cache misses, we have developed three prefetching techniques that
can handle all these cases. The first technique enhances traditional stream
buffer designs by following an address prediction stream instead of a fixed
stride as originally proposed. This improves the stream buffers ability to
target pointer-based programs. We also explore utilizing a hardware pointer
cache to predict the values of pointer loads in order to break down long
dependence chains caused by recurrent pointer accesses. This allows overlapping
the execution of dependent instructions with the actual memory access,
providing significant benefits. Another technique, Speculative Precomputation,
attempts to prefetch data ahead of its use, by running a shortened version of
the program on a spare multi-threaded hardware context. The final scheme we
propose builds upon speculative precomputation and uses the pointer cache for
the pointer loads in speculative threads. Furthermore, we add control flow
instructions to the speculative threads to guide them down the correct
traversal path when multiple next transitions are possible. Our results show
considerable benefits when these techniques are utilized together,
complementing each other.
Pre-2018 CSE ID: CS2003-0753