VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost yet limited gains of chain-of-thought (CoT) reasoning in video understanding by proposing an "on-demand reasoning" framework. During training, the model adopts a "think once, answer twice" strategy, simultaneously learning to produce direct answers and refined responses via CoT. At inference time, it dynamically decides whether to invoke CoT based on the confidence of its initial prediction. This approach is the first to demonstrate that, in reinforcement learning–based video models, direct answering can match or even surpass CoT performance. Efficient reasoning control is achieved through a confidence-driven two-stage supervision and reward mechanism. Experiments show state-of-the-art results across multiple video question-answering and localization benchmarks, with an average 3.3× reduction in response length (e.g., from 149 to 44 tokens), significantly improving the trade-off between efficiency and accuracy.

Technology Category

Application Category

📝 Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
Problem

Research questions and friction points this paper is trying to address.

video understanding
chain-of-thought reasoning
computational efficiency
multimodal large language models
reasoning necessity
Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoAuto-R1
reason-when-necessary
Thinking Once Answering Twice
adaptive reasoning
multimodal reasoning
🔎 Similar Papers
S
Shuming Liu
Meta AI
Mingchen Zhuge
Mingchen Zhuge
KAUST AI
MultimodalLLMAI AgentsCode Generation
Changsheng Zhao
Changsheng Zhao
Meta AI
Machine LearningNatural Language Processing
Jun Chen
Jun Chen
Research Scientist, Meta AI
Multi-modal learning
L
Lemeng Wu
Meta AI
Zechun Liu
Zechun Liu
Meta AI
computer vision
Chenchen Zhu
Chenchen Zhu
Research Scientist, Meta Reality Labs
Computer VisionDeep LearningPerception
Zhipeng Cai
Zhipeng Cai
Senior Researcher, Meta
perception,multi-modal generationoptimization
C
Chong Zhou
Meta AI
Haozhe Liu
Haozhe Liu
KAUST
Computer VisionReinforcement LearningMultimodalImage/Video Generation
Ernie Chang
Ernie Chang
Research Scientist, Meta AI
Natural Language ProcessingData EfficiencyMultilingualMultimodal
Saksham Suri
Saksham Suri
Research Scientist, Meta Reality Labs
Computer VisionMachine LearningDeep Learning
Hongyu Xu
Hongyu Xu
Research Scientist, Meta Reality Labs
Spatial PerceptionGenAIMultimodalRoomPlan
Q
Qingyang Qian
Meta AI
Wei Wen
Wei Wen
Research Scientist, AI at Meta
Deep LearningArtificial IntelligenceComputer Vision
B
Bala Varadarajan
Meta AI
Zhuang Liu
Zhuang Liu
Assistant Professor, Princeton University
Deep LearningComputer VisionMachine Learning
Hu Xu
Hu Xu
Meta AI (FAIR Labs)
Efficient Pre-trainingMeta LearningMulti-modal Learning
Florian Bordes
Florian Bordes
Meta, FAIR
Responsible AIDeep LearningArtificial Intelligence
R
Raghuraman Krishnamoorthi
Meta AI
Bernard Ghanem
Bernard Ghanem
Professor, King Abdullah University of Science and Technology
computer visionmachine learning
Vikas Chandra
Vikas Chandra
Meta
AI Research
Yunyang Xiong
Yunyang Xiong
University of Wisconsin-Madison
Computer VisionMachine LearningDeep Learning