TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generalizing video reasoning models to new domains, which typically rely on large-scale annotated data and multi-stage training. The study introduces, for the first time, test-time reinforcement learning to video-language tasks, proposing a label-free adaptive approach that operates effectively under single-sample or single-batch conditions. By progressively reasoning over subsets of video frames, the method integrates a batch-aware frequency-based reward mechanism with a multi-armed bandit strategy for keyframe selection. This enables robust cross-dataset generalization without access to ground-truth labels during testing. Empirical results demonstrate consistent performance gains over state-of-the-art models that depend on extensive training data, significantly improving both adaptation efficiency and accuracy at test time across diverse video reasoning benchmarks.
📝 Abstract
Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.
Problem

Research questions and friction points this paper is trying to address.

video reasoning
test-time adaptation
temporal understanding
multimodal learning
domain adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Adaptation
Reinforcement Learning
Video Reasoning
Frame Selection
Multimodal Understanding
🔎 Similar Papers
No similar papers found.