UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the vulnerability of pretrained vision Transformers (ViTs) to spurious tokens in dense prediction tasks, a problem exacerbated by overly narrow definitions of such tokens in existing approaches. To overcome this limitation, the authors propose UniRefiner, a unified optimization framework that systematically defines and categorizes three distinct types of spurious tokens. UniRefiner introduces a contrastive register mechanism to simultaneously achieve semantic alignment and suppress spurious signals, enabling the ViT to autonomously identify and discard interfering tokens. With only approximately 5,000 images and a few fine-tuning epochs, the method boosts EVA-CLIP-8B to a 51.9% mIoU on ADE20K—an improvement of 9.4%—and achieves up to a 22% gain in zero-shot segmentation accuracy, substantially outperforming specialized models such as DINOv2.

📝 Abstract

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

Problem

Research questions and friction points this paper is trying to address.

spurious tokens

Vision Transformers

spatial representation

dense prediction

representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

spurious token

Vision Transformer

contrastive register