Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision Transformers (ViTs) excel at modeling global dependencies but struggle to efficiently represent fine-grained local structures. Existing multi-scale approaches are constrained by fixed patch sizes, leading to redundant computation and limited adaptability. To address this, we propose a dynamic coarse-to-fine ViT framework. First, we introduce a novel dynamic granularity evaluation mechanism—based on edge density, entropy, and frequency-domain features—that jointly optimizes patch and window sizes via learnable parameters α and β in an end-to-end manner. Second, the framework integrates two complementary modules: Coarse Granularity Evaluation for global context abstraction and Fine-grained Refinement for localized detail enhancement, synergistically combining multi-scale attention with dynamic computational allocation. Evaluated on image classification and object detection benchmarks, our method significantly improves fine-grained discrimination capability while achieving superior accuracy–FLOPs trade-offs compared to state-of-the-art ViTs.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and {eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

ViTs struggle with fine-grained local details due to fixed patch sizes

Existing multi-scale approaches introduce redundant computational overhead

Need adaptive granularity adjustment for accuracy-efficiency trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic patch size adjustment using image complexity

Coarse granularity evaluation with edge and entropy cues

Fine-grained refinement balancing global and local features

🔎 Similar Papers

No similar papers found.

Authors to Follow