CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Modern ViT backbones (e.g., SAM, DINOv3) employ windowed attention and relative positional encoding, posing compatibility challenges for token compression due to their inherent spatial structure. To address this, we propose a spatially structure-preserving token merging method. Our approach features: (i) a 2D structured token layout and reduction strategy; (ii) a spatially aware merging algorithm that explicitly preserves local relative positional relationships; and (iii) a per-dimension maximum-magnitude feature retention mechanism to ensure information integrity. The method is plug-and-play and requires only brief fine-tuning. On SAM-H, it achieves 1.25× inference speedup with only a 0.7% mIoU drop; on DeiT-B, a single fine-tuning epoch yields 1.15× acceleration without accuracy loss. The method generalizes effectively across vision tasks, including semantic segmentation and image classification.

Technology Category

Application Category

📝 Abstract

Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-art results on spatial and non-spatial architectures across various vision tasks. Specifically, we achieve 1.25x speedup on SAM-H with only 0.7% mIOU drop evaluated on COCO off-the-shelf, and 1.15x speedup on DeiT-B with no top-1 accuracy drop on ImageNet within just one epoch of fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Preserving spatial structure in token merging for ViT backbones

Maintaining spatial integrity while reducing tokens efficiently

Enabling compatibility with spatial architectures through structured reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

2D reduction strategy for structured token layouts

Spatial-aware merging algorithm preserving relative positions

Max-magnitude-per-dimension token representation for features

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval