Graph analytics has become a popular method for understanding the relationships between data collected from various sources, such as social media, sensor feeds, and scientific data. This allows analysts to identify patterns in the data and answer difficult questions that were previously unanswerable. A more complete picture of the problem can be understood by understanding the complex relationships between different data feeds. However, due to the sparse nature of real-world graphs, these applications tend to show random memory access patterns and low locality. Current computing and memory systems are optimized for applications that exhibit high locality by utilizing deep cache hierarchy and large memory access support. Throughout this work, we show the limitations of running graph workloads on general-purpose systems in terms of both performance and scalability. This work presents scalable solutions for processing graph applications. This dissertation presents the following case studies:
First, designing a graph processing system that can scale to graph sizes that are orders of magnitude larger than what is possible on a single accelerator requires a careful codesign of accelerator memory bandwidth and capacity, the interconnect bandwidth between accelerators, and the overall system architecture. We present a high-level bottleneck-analysis model for the design and evaluation of scalable and balanced accelerators for graph processing. We further studied the scalability limitations of previous graph accelerators and designed an accelerator that overcame those limitations. We used the analytical model to capture the system-level requirements, such as the memory systems.
Second, we analyzed the network requirements of graphs as we scaled up the system to larger nodes and more memory.We demonstrate that the traffic pattern between graph processing nodes has a uniform random distribution, and while the bandwidth requirements are not significant, having a low-diameter interconnect can improve the performance.
The third part of this work proposes a new memory system that is optimized for applications with random memory patterns. Our goal was to create a new memory that reduces the amount of contention on the data path since contention on the shared resources was the main source of performance bottleneck. By utilizing 2.5D/3D integration and multi-wavelength (WDM) silicon photonic (SiPh) technologies, we create a new memory system with low memory access latency and latency variation in addition to 4$\times$ higher bandwidth compared to high bandwidth memories such as HBM2 given the same amount of capacity.