CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Complex video reasoning faces bottlenecks in requiring large-scale annotated data and strong visual perception capabilities. Method: This paper proposes a training-free, multi-stage explicit reasoning paradigm enabling System-2–like structured inference for videos. We introduce the first training-free chain-of-thought (CoT) architecture for video understanding, integrating dynamic CoT path routing, hierarchical question decomposition, and frame-level self-consistency verification, alongside a novel taxonomy for video question classification. Contribution/Results: Our approach decouples reasoning from reliance on end-to-end visual perception, shifting instead to interpretable and verifiable symbolic reasoning paths. Evaluated on EgoSchema and VideoEspresso, it achieves absolute improvements of +9.3% and +5.6%, respectively—matching or surpassing state-of-the-art multimodal foundation models including GPT-4V, GPT-4o, and Gemini 1.5 Flash.

Technology Category

Application Category

📝 Abstract
System2 reasoning is developing rapidly these days with the emergence of Deep- Thinking Models and chain-of-thought technology, which has become a centralized discussion point in the AI community. However, there is a relative gap in the research on complex video reasoning at present. In this work, we propose CoT-Vid, a novel training-free paradigm for the video domain with a multistage complex reasoning design. Distinguishing from existing video LLMs, which rely heavily on perceptual abilities, it achieved surprising performance gain with explicit reasoning mechanism. The paradigm consists of three main components: dynamic inference path routing, problem decoupling strategy, and video self-consistency verification. In addition, we propose a new standard for categorization of video questions. CoT- Vid showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models, such as GPT-4V, GPT-4o and Gemini-1.5-flash. Our codebase will be publicly available soon.
Problem

Research questions and friction points this paper is trying to address.

Develops training-free video reasoning with dynamic chain-of-thought routing
Addresses complex video reasoning gaps via self-verification mechanisms
Outperforms existing models on benchmarks without heavy perceptual reliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic inference path routing for flexible reasoning
Problem decoupling strategy for complex tasks
Video self-consistency verification for accuracy
Hongbo Jin
Hongbo Jin
Peking University
LLMvideo LLM3D LLM
R
Ruyang Liu
School of Electronic and Computer Engineering, Peking University
W
Wenhao Zhang
School of Electronic and Computer Engineering, Peking University
Guibo Luo
Guibo Luo
Peking University
medical imagingprivacy computing
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning