Leveraging Registers in Vision Transformers for Robust Adaptation

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Vision Transformers (ViTs) exhibit limited out-of-distribution (OOD) generalization and anomaly detection performance, primarily due to interference from high-norm tokens. To address this, we propose a lightweight, zero-overhead feature construction method: fusing the CLS token embedding with the mean-pooled embedding of register tokens. This simple yet effective fusion significantly enhances OOD robustness and anomaly rejection capability. We provide the first systematic empirical validation that register embeddings consistently improve OOD performance across diverse benchmarks and ViT architectures. Crucially, our method preserves in-distribution (ID) accuracy while simultaneously boosting both OOD generalization and anomaly detection precision. Extensive experiments on multiple ViT backbones demonstrate consistent improvements: +2–4% OOD top-1 accuracy and −2–3% false positive rate in anomaly detection—all without introducing any inference latency overhead.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have shown success across a variety of tasks due to their ability to capture global image representations. Recent studies have identified the existence of high-norm tokens in ViTs, which can interfere with unsupervised object discovery. To address this, the use of"registers"which are additional tokens that isolate high norm patch tokens while capturing global image-level information has been proposed. While registers have been studied extensively for object discovery, their generalization properties particularly in out-of-distribution (OOD) scenarios, remains underexplored. In this paper, we examine the utility of register token embeddings in providing additional features for improving generalization and anomaly rejection. To that end, we propose a simple method that combines the special CLS token embedding commonly employed in ViTs with the average-pooled register embeddings to create feature representations which are subsequently used for training a downstream classifier. We find that this enhances OOD generalization and anomaly rejection, while maintaining in-distribution (ID) performance. Extensive experiments across multiple ViT backbones trained with and without registers reveal consistent improvements of 2-4% in top-1 OOD accuracy and a 2-3% reduction in false positive rates for anomaly detection. Importantly, these gains are achieved without additional computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

out-of-distribution generalization

anomaly rejection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformers

Enhanced Feature Representations

Anomaly Detection

🔎 Similar Papers

No similar papers found.