Video Reasoning without Training

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Video reasoning typically relies on reinforcement learning (RL) and long chain-of-thought prompting, incurring high computational overhead and lacking controllable, real-time adjustment of the reasoning process. To address this, we propose a training-free, inference-time control method. We first discover a dynamic entropy pattern in the outputs of large vision models during video understanding; leveraging this insight, we design an entropy-based objective function and a trainable controller, integrated with value caching for adaptive optimization and inference-time micro-exploration/micro-exploitation tuning. Crucially, our approach entirely avoids RL and supervised fine-tuning. On multiple video reasoning benchmarks, it achieves accuracy competitive with RL-based models while reducing output token count by 58.6%, significantly improving both efficiency and stability. Our core contribution is establishing the first entropy-driven, training-agnostic paradigm for dynamic control of video reasoning processes.

Technology Category

Application Category

📝 Abstract

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in video reasoning models

Improves reasoning control without reinforcement learning

Optimizes inference efficiency via entropy-based adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses entropy to guide model reasoning process

Optimizes value cache via small controller during inference

Improves exploration-exploitation without training or RL

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models