AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
ViG models suffer from static, architecture-dependent node-neighbor feature aggregation, limiting their capacity to model complex neighborhood relationships in image recognition. To address this, we propose a general, structure-agnostic dynamic aggregation paradigm based on cross-attention: the central node generates queries, while neighboring nodes jointly produce keys and values, enabling non-local, adaptive message passing. Built upon this principle, we design AttentionViG, achieving state-of-the-art top-1 accuracy (83.9%) on ImageNet-1K without task-specific tuning. The model further transfers effectively to MS COCO (object detection and instance segmentation) and ADE20K (semantic segmentation), consistently outperforming existing ViG variants while maintaining computational overhead comparable to mainstream vision models. Our core contribution is the first integration of cross-attention into the aggregation layer of visual graph neural networks, establishing a unified, efficient, and scalable neighborhood modeling mechanism.

Technology Category

Application Category

📝 Abstract
Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.
Problem

Research questions and friction points this paper is trying to address.

Developing dynamic neighbor aggregation for Vision GNNs to capture complex relationships
Creating versatile cross-attention method without architecture-specific refinements
Enhancing image recognition and downstream task performance while maintaining efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention dynamically aggregates node-neighbor relationships
AttentionViG architecture enables non-local message passing
Achieves SOTA performance with efficient computational cost
🔎 Similar Papers
No similar papers found.
H
Hakan Emre Gedik
The University of Texas at Austin
A
Andrew Martin
The University of Texas at Austin
Mustafa Munir
Mustafa Munir
The University of Texas at Austin
Machine LearningComputer VisionGenerative AISuperconducting ElectronicsNeurosymbolic AI
O
Oguzhan Baser
The University of Texas at Austin
Radu Marculescu
Radu Marculescu
The University of Texas at Austin
machine learningedgeAIembedded systemscyber-physical systemssocial networks
S
Sandeep P. Chinchali
The University of Texas at Austin
A
Alan C. Bovik
The University of Texas at Austin