ViT$^3$: Unlocking Test-Time Training in Vision

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Test-time training (TTT) for visual sequence modeling lacks systematic design principles, hindering its practical adoption. Method: We propose ViT³, the first purely TTT-native vision architecture, which reformulates attention as an online learning problem—eliminating conventional attention computation and instead constructing a compact internal model solely from key-value pairs, enabling linear time complexity and full parallelization. Contribution/Results: Through extensive empirical analysis, we distill six foundational design principles for visual TTT. ViT³ matches or surpasses Mamba and state-of-the-art linear attention models across image classification, generation, detection, and segmentation, substantially narrowing the performance gap with optimized pretrained ViTs. To our knowledge, this is the first work achieving efficient, end-to-end test-time adaptation for visual sequence modeling without relying on pretrained attention mechanisms—bridging the gap between TTT theory and real-world applicability.

Technology Category

Application Category

📝 Abstract
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.
Problem

Research questions and friction points this paper is trying to address.

Systematically studies test-time training designs for vision tasks
Addresses lack of guidelines for inner module and training choices
Proposes ViT^3 model to achieve linear complexity and competitive performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online learning attention reformulation for efficiency
Linear complexity with parallelizable computation design
Systematic empirical study distilling six design principles
🔎 Similar Papers
No similar papers found.
Dongchen Han
Dongchen Han
Tsinghua University
Computer VisionDeep Learning
Yining Li
Yining Li
Shanghai AI Laboratory
Multimodal LearningLarge Language Model
T
Tianyu Li
Tsinghua University
Z
Zixuan Cao
Tsinghua University
Z
Ziming Wang
Alibaba Group
Jun Song
Jun Song
Shenzhen University
nanophotonics
Y
Yu Cheng
Alibaba Group
B
Bo Zheng
Alibaba Group
G
Gao Huang
Tsinghua University