SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

📅 2024-11-14
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) inherently lack multi-scale modeling capability, hindering effective capture of cross-scale semantics and spatial context in image classification. To address this, we propose Scale-Aware Graph Attention (SAGA), the first method to map multi-scale features—extracted by EfficientNetV2—onto a structured spatial-semantic weighted graph. SAGA dynamically constructs a sparse graph by jointly encoding spatial proximity and feature similarity, and introduces scale-aware attention at the node level to guide collaborative encoding between Graph Attention Networks (GATs) and Transformers. This enables high-fidelity, multi-granularity contextual modeling. On multiple image classification benchmarks, SAGA consistently outperforms standard ViTs and existing graph-augmented ViT variants, achieving absolute Top-1 accuracy gains of 1.8–3.2%. Moreover, it demonstrates superior few-shot generalization and robustness under distribution shifts.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image's contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps, dividing them into patches to preserve richer semantic information compared to directly patching the input images. The patches are structured into a graph using spatial and feature similarities, where a Graph Attention Network (GAT) refines the node embeddings. This refined graph representation is then processed by a Transformer encoder, capturing long-range dependencies and complex interactions. We evaluate SAG-ViT on benchmark datasets across various domains, validating its effectiveness in advancing image classification tasks. Our code and weights are available at https://github.com/shravan-18/SAG-ViT.
Problem

Research questions and friction points this paper is trying to address.

Visual Transformers
Multi-scale Features
Image Classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAG-ViT
Multi-scale Feature Processing
Graph Attention Mechanism
🔎 Similar Papers
No similar papers found.