Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study investigates whether zero ablation overestimates the actual dependence of model performance on register contents in DINO vision transformers. To address this, the authors propose more principled activation replacement baselines—namely mean substitution, noise injection, and cross-image register shuffling—and systematically evaluate them across classification, segmentation, and correspondence tasks. Experiments reveal that zero ablation incurs performance drops as large as 36.6 percentage points, whereas the proposed alternatives degrade performance by no more than 1 percentage point. These findings indicate that the model relies on plausible class-conditional activation patterns rather than precise, image-specific register values. This work is the first to demonstrate that zero ablation substantially inflates estimates of register importance and establishes a more reliable ablation paradigm for future interpretability research.

Technology Category

Application Category

📝 Abstract

Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

Problem

Research questions and friction points this paper is trying to address.

zero-ablation

vision transformers

DINO

representation perturbation

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-ablation

vision transformers

representation perturbation

feature dependence