๐ค AI Summary
This work addresses a fundamental computational incompatibility between matrix operations and graph traversal in graph dynamic programming, which hinders efficient execution on homogeneous in-memory architectures. To overcome this challenge, the authors propose GEN-Graph, a heterogeneous in-memory computing chip that, for the first time, enables scalable and exact solutions for general-purpose graph dynamic programming. The chip leverages 2.5D packaging to integrate processing-using-memory (PUM) units optimized for matrix computations and processing-near-memory (PNM) units tailored for graph traversal, guided by an algorithm-structure-aware hardware-software co-design. Experimental results demonstrate that the matrix unit achieves a 42.8ร speedup and 392ร higher energy efficiency over an H100 GPU on the all-pairs shortest paths (APSP) task, while the traversal unit delivers throughput of 2.56 million and 39,300 reads per second for short and long reads, respectivelyโup to 2.56ร higher than existing accelerators.
๐ Abstract
While graph-based dynamic programming (DP) is a cornerstone of genomics and network analytics, its efficiency is hampered by fundamentally conflicting computational patterns. Matrix-centric DP drives regular, compute-bound network analytics, while topology-centric DP handles irregular, memory-bound genomic traversals. These two categories of DP have substantially different computation patterns and dataflows, which makes it difficult for a single homogeneous processing-in-memory (PIM) architecture to efficiently support both.
This work presents GEN-Graph, a novel heterogeneous PIM chiplet that integrates two types of specialized compute tiles within a 2.5D package: Matrix-tile, a processing-using-memory (PUM) tile optimized for matrix-centric workloads, such as all-pairs shortest path (APSP); and traversal-tile, a processing-near-memory (PNM) tile optimized for traversal-centric DP workloads, such as DNA sequence alignment. Our hardware-software co-design employs recursive partitioning and reconfigurable windowed bit-parallel logic to ensure exact computation. Results show the matrix tile achieves 42.8x speedup and 392x energy efficiency over the NVIDIA H100 GPU for APSP. For sequence-to-graph alignment, the traversal tile sustains 2.56 million reads/s (short-reads) and 39.3 thousand reads/s (long-reads), outperforming state-of-the-art accelerators by up to 2.56x in throughput. GEN-Graph provides the first scalable, exact solution for general DP dataflows by matching hardware specialization to algorithmic structure.