🤖 AI Summary
ViG models suffer from static, architecture-dependent node-neighbor feature aggregation, limiting their capacity to model complex neighborhood relationships in image recognition. To address this, we propose a general, structure-agnostic dynamic aggregation paradigm based on cross-attention: the central node generates queries, while neighboring nodes jointly produce keys and values, enabling non-local, adaptive message passing. Built upon this principle, we design AttentionViG, achieving state-of-the-art top-1 accuracy (83.9%) on ImageNet-1K without task-specific tuning. The model further transfers effectively to MS COCO (object detection and instance segmentation) and ADE20K (semantic segmentation), consistently outperforming existing ViG variants while maintaining computational overhead comparable to mainstream vision models. Our core contribution is the first integration of cross-attention into the aggregation layer of visual graph neural networks, establishing a unified, efficient, and scalable neighborhood modeling mechanism.
📝 Abstract
Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.