Despite the success of parallel architectures and domain-specific accelerators in boosting the performance of emerging parallel workloads, contemporary computer organizations still face the bottleneck of data movement between processors and the main memory. Processing-in-memory (PIM) architectures, especially those designs integrating compute logics near DRAM memory banks, are promising to address this bottleneck. However, such an in-DRAM near-bank integration faces hardware and software design challenges in performance, area overheads, architecture complexity, and programmability.
To address these challenges, this dissertation focuses on developing efficient hardware and software solutions for in-DRAM near-bank computing. First, this dissertation investigates the memory bandwidth bottleneck of contemporary hardware platforms through in-depth workload characterization, which motivates in-DRAM near-bank processing solutions. Second, this dissertation proposes multiple full-stack in-DRAM near-bank processing solutions targeting different application scopes that vary from application-specific to general-purpose computing. These solutions reveal a wide spectrum of trade-off points among hardware efficiency, architecture flexibility, and software complexity. On top of these solutions, this dissertation introduces an open-source simulation framework that supports the architectural and software optimization studies of in-DRAM near-bank processing. Finally, this dissertation develops novel machine learning-based compiler optimizations for partitioning workloads on a chiplet hardware platform that has a distributed compute-memory abstraction similar to in-DRAM near-bank architectures.