VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the high computational cost yet limited gains of chain-of-thought (CoT) reasoning in video understanding by proposing an "on-demand reasoning" framework. During training, the model adopts a "think once, answer twice" strategy, simultaneously learning to produce direct answers and refined responses via CoT. At inference time, it dynamically decides whether to invoke CoT based on the confidence of its initial prediction. This approach is the first to demonstrate that, in reinforcement learning–based video models, direct answering can match or even surpass CoT performance. Efficient reasoning control is achieved through a confidence-driven two-stage supervision and reward mechanism. Experiments show state-of-the-art results across multiple video question-answering and localization benchmarks, with an average 3.3× reduction in response length (e.g., from 149 to 44 tokens), significantly improving the trade-off between efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

Problem

Research questions and friction points this paper is trying to address.

video understanding

chain-of-thought reasoning

computational efficiency

multimodal large language models

reasoning necessity

Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoAuto-R1

reason-when-necessary

Thinking Once Answering Twice