🤖 AI Summary
Fixed-size patching in non-convolutional vision models (e.g., ViT, Vision Mamba) introduces redundant background encoding and degrades critical local detail preservation. To address this, we propose the Differentiable Adaptive Region Tokenizer (DART), a novel dynamic patching mechanism that jointly learns region importance scores and applies piecewise differentiable quantile operations. DART generates variable-sized, semantically dense patches conditioned on image content, enabling adaptive token sparsification with only ~1M additional parameters. Evaluated on ImageNet-1K, DART improves DeiT’s top-1 accuracy by 2.1% while reducing FLOPs by 45%. Consistent gains—both in accuracy and computational efficiency—are also observed on Vim and VideoMamba. By aligning tokenization granularity with semantic structure, DART significantly enhances both model efficiency and representation capability without architectural modifications.
📝 Abstract
Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.