🤖 AI Summary
Modern ViT backbones (e.g., SAM, DINOv3) employ windowed attention and relative positional encoding, posing compatibility challenges for token compression due to their inherent spatial structure. To address this, we propose a spatially structure-preserving token merging method. Our approach features: (i) a 2D structured token layout and reduction strategy; (ii) a spatially aware merging algorithm that explicitly preserves local relative positional relationships; and (iii) a per-dimension maximum-magnitude feature retention mechanism to ensure information integrity. The method is plug-and-play and requires only brief fine-tuning. On SAM-H, it achieves 1.25× inference speedup with only a 0.7% mIoU drop; on DeiT-B, a single fine-tuning epoch yields 1.15× acceleration without accuracy loss. The method generalizes effectively across vision tasks, including semantic segmentation and image classification.
📝 Abstract
Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.