ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational redundancy in dense feature extraction caused by full-resolution tokens in conventional vision Transformers. To this end, we propose a coarse-to-fine hybrid-resolution Vision Transformer architecture that begins with low-resolution tokens and dynamically identifies semantic boundary regions using a lightweight boundary scoring allocator. High-resolution tokens are adaptively introduced only in these critical regions, guided by a hybrid-resolution attention mechanism and an iterative token allocation strategy. This approach enables efficient focusing of computational resources while preserving high semantic sensitivity and enhancing semantic consistency among token representations. Our method achieves state-of-the-art performance on ADE20K and COCO-Stuff benchmarks: the ARTA-Base variant (≈100M parameters) attains 54.6 mIoU on ADE20K, demonstrating superior computational and memory efficiency compared to existing backbone networks.
📝 Abstract
We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.
Problem

Research questions and friction points this paper is trying to address.

dense feature extraction
computational efficiency
semantic boundaries
token allocation
mixed-resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive token allocation
mixed-resolution attention
coarse-to-fine vision transformer
efficient dense feature extraction
semantic boundary-aware
🔎 Similar Papers
No similar papers found.