Parallel Routing for FPGAs with Sparse Intra-Cluster Routing Crossbars
- Author(s): Ould Mohamed Moctar, Yehdhih
- Advisor(s): Brisk, Philip
- et al.
Routing is the most time consuming step of the process of synthesizing an electronic design on a Field Programmable Gate Array (FPGA). It involves the creation of a Routing Resource Graph (RRG); a large data structure representing the physical architecture of the FPGA. In this work, we first introduce two scalable routing heuristics for FPGAs with sparse intra-cluster routing crossbars: SElective RRG Expansion (SERRGE), which compresses the RRG, and dynamically decompresses it during routing, and Partial Pre-Routing (PPR), which locally routes all nets in each cluster, and routes global nets afterwards. Our experiments show that: (1) PPR and SERRGE converge faster than a traditional router using a fully-expanded RRG; (2) they both achieve better routability than the traditional router, given a limited runtime budget; and (3) PPR uses far less memory and runs much faster than SERRGE, making it ideal for high capacity FPGAs.
We then introduce a new dynamic-multiplexing based hybrid logic blocks that can be configured to operate as regular configurable logic blocks, or to implement shifting operations required for mantissa alignment and normalization in floating point operations. We show that: (1) the number of CLBs required for shifting operations is reduced by 67%, and if shifting is not required, these hybrid logic blocks can be configured for normal operation, so no functionality is sacrificed; (2) the area overhead incurred by these modifications is small, and (3) there is no negative impact in terms of clock frequency or routability for benchmarks that do not use floating point shifting.
Finally, we investigate the parallelization of FPGA routing on both GPUs and Multicore, shared memory CPUs, using a speculation-based approach. The router is a parallel implementation of PathFinder, which is the basis for most commercial FPGA routers. Our results demonstrate scalability for large benchmarks and that the amount of available parallelism depends primarily on the circuit size, not the inter-dependence of signals. The Multicore-based parallel implementation achieved an average speedup of approximately 6x while the GPU achieved (10-15x) in comparison to the single threaded router implemented in the publicly available Versatile Place and Route (VPR) framework.