DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Fixed-size patching in non-convolutional vision models (e.g., ViT, Vision Mamba) introduces redundant background encoding and degrades critical local detail preservation. To address this, we propose the Differentiable Adaptive Region Tokenizer (DART), a novel dynamic patching mechanism that jointly learns region importance scores and applies piecewise differentiable quantile operations. DART generates variable-sized, semantically dense patches conditioned on image content, enabling adaptive token sparsification with only ~1M additional parameters. Evaluated on ImageNet-1K, DART improves DeiT’s top-1 accuracy by 2.1% while reducing FLOPs by 45%. Consistent gains—both in accuracy and computational efficiency—are also observed on Vim and VideoMamba. By aligning tokenization granularity with semantic structure, DART significantly enhances both model efficiency and representation capability without architectural modifications.

Technology Category

Application Category

📝 Abstract

Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.

Problem

Research questions and friction points this paper is trying to address.

Adaptive patch partitioning for sparse object distribution

Reducing background encoding and preserving local details

Efficient token allocation with minimal computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive image partitioning into varying patch sizes

Learnable region scores with differentiable quantile operations

Efficient token allocation reducing FLOPs by 45%

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM