PIM-FW: Hardware-Software Co-Design of All-pairs Shortest Paths in DRAM

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scalability bottleneck of the Floyd–Warshall (FW) algorithm for All-Pairs Shortest Paths (APSP) on CPU/GPU architectures—caused by its O(n³) time complexity and excessive DRAM data movement—this paper proposes a hardware–software co-designed acceleration framework targeting HBM3 memory. We introduce a bank-level, fine-grained bit-serial processing element (PE) array coupled with an interleaved dataflow, enabling complete distance updates of blocked FW computation within individual HBM3 memory banks and eliminating off-chip DRAM accesses. By synergistically integrating processing-in-memory (PIM) and near-memory computing (NMC), we establish a hybrid compute model. Evaluated on an 8192×8192 graph, our design achieves 18.7× end-to-end speedup over state-of-the-art GPU implementations and reduces DRAM energy consumption by 3200×, significantly surpassing conventional architectures in both performance and energy efficiency for APSP.

Technology Category

Application Category

📝 Abstract
All-pairs shortest paths (APSP) is a fundamental algorithm used for routing, logistics, and network analysis, but the cubic time complexity and heavy data movement of the canonical Floyd-Warshall (FW) algorithm severely limits its scalability on conventional CPUs or GPUs. In this paper, we propose PIM-FW, a novel co-designed hardware architecture and dataflow that leverages processing in and near memory architecture designed to accelerate blocked FW algorithm on an HBM3 stack. To enable fine-grained parallelism, we propose a massively parallel array of specialized bit-serial bank PE and channel PE designed to accelerate the core min-plus operations. Our novel dataflow complements this hardware, employing an interleaved mapping policy for superior load balancing and hybrid in and near memory computing model for efficient computation and reduction. The novel in-bank computing approach allows all distance updates to be performed and stored in memory bank, a key contribution is that eliminates the data movement bottleneck inherent in GPU-based approaches. We implement a full software and hardware co-design using a cycle-accurate simulator to simulate an 8-channel, 4-Hi HBM3 PIM stack on real road-network traces. Experimental results show that, for a 8192 x 8192 graph, PIM-FW achieves a 18.7x speedup in end-to-end execution, and consumes 3200x less DRAM energy compared to a state-of-the-art GPU-only Floyd-Warshall.
Problem

Research questions and friction points this paper is trying to address.

Accelerates all-pairs shortest paths algorithm using processing-in-memory
Reduces data movement bottleneck in Floyd-Warshall with in-bank computing
Improves scalability and energy efficiency for large graph computations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-software co-design for in-memory APSP acceleration
Bit-serial bank and channel PEs for fine-grained parallelism
Interleaved mapping and hybrid computing to eliminate data movement
🔎 Similar Papers
No similar papers found.