🤖 AI Summary
Dynamic Image Graph Construction (DIGC) constitutes the primary latency bottleneck in Vision Graph Neural Networks (ViG), accounting for up to 95% of total inference time at high resolutions. Existing algorithmic optimizations typically compromise flexibility, accuracy, or generality. This paper introduces the first FPGA-based streaming deep-pipelined architecture tailored for DIGC: it employs on-chip tiling and localized computation to drastically reduce off-chip memory accesses; integrates streaming local merge-sort with heap-insertion-driven global k-way merging to jointly ensure accuracy, configurability, and scalability; and supports seamless adaptation across diverse ViG models and arbitrary input resolutions. Post-place-and-route, the design achieves high operating frequency. Under typical configurations, it delivers 16.6× and 6.8× speedup over optimized CPU and GPU implementations, respectively—establishing an efficient hardware paradigm for real-time ViG deployment.
📝 Abstract
Vision Graph Neural Networks (Vision GNNs, or ViGs) represent images as unstructured graphs, achieving state of the art performance in computer vision tasks such as image classification, object detection, and instance segmentation. Dynamic Image Graph Construction (DIGC) builds image graphs by connecting patches (nodes) based on feature similarity, and is dynamically repeated in each ViG layer following GNN based patch (node) feature updates. However, DIGC constitutes over 50% of end to end ViG inference latency, rising to 95% at high image resolutions, making it the dominant computational bottleneck. While hardware acceleration holds promise, prior works primarily optimize graph construction algorithmically, often compromising DIGC flexibility, accuracy, or generality. To address these limitations, we propose a streaming, deeply pipelined FPGA accelerator for DIGC, featuring on chip buffers that process input features in small, uniform blocks. Our design minimizes external memory traffic via localized computation and performs efficient parallel sorting with local merge sort and global k way merging directly on streaming input blocks via heap insertion. This modular architecture scales seamlessly across image resolutions, ViG layer types, and model sizes and variants, and supports DIGC across diverse ViG based vision backbones. The design achieves high clock frequencies post place and route due to the statically configured parallelism minimizing critical path delay and delivers up to 16.6x and 6.8x speedups over optimized CPU and GPU DIGC baselines.