🤖 AI Summary
Vision Transformers (ViTs) excel at modeling global dependencies but struggle to efficiently represent fine-grained local structures. Existing multi-scale approaches are constrained by fixed patch sizes, leading to redundant computation and limited adaptability. To address this, we propose a dynamic coarse-to-fine ViT framework. First, we introduce a novel dynamic granularity evaluation mechanism—based on edge density, entropy, and frequency-domain features—that jointly optimizes patch and window sizes via learnable parameters α and β in an end-to-end manner. Second, the framework integrates two complementary modules: Coarse Granularity Evaluation for global context abstraction and Fine-grained Refinement for localized detail enhancement, synergistically combining multi-scale attention with dynamic computational allocation. Evaluated on image classification and object detection benchmarks, our method significantly improves fine-grained discrimination capability while achieving superior accuracy–FLOPs trade-offs compared to state-of-the-art ViTs.
📝 Abstract
Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and {eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.