Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing “textual thinking” and “visual thinking” paradigms are constrained by image staticity and modality disjunction, hindering dynamic, coherent multimodal reasoning. To address this, we propose the novel “video thinking” paradigm, which for the first time formalizes video generation as a unified cross-modal reasoning process—enabling spatiotemporally consistent understanding and synthesis. We introduce VideoThinkBench, the first benchmark explicitly designed for video-based reasoning, and enhance state-of-the-art video generation models (e.g., Sora-2) with self-consistency mechanisms and in-context learning to improve reasoning robustness. Experiments demonstrate that our approach matches or surpasses leading vision-language models on vision-dominant tasks, achieving 92.0% accuracy on MATH and 75.53% on MMMU. These results empirically validate video as an effective and generalizable medium for unified multimodal reasoning.

Technology Category

Application Category

📝 Abstract
"Thinking with Text"and"Thinking with Images"paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce"Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions"thinking with video"as a unified multimodal reasoning paradigm.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of static images in representing dynamic processes
Bridging visual and textual reasoning in unified temporal framework
Establishing video generation as unified multimodal reasoning paradigm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video generation models bridge visual and textual reasoning
Sora-2 serves as capable multimodal reasoning model
Unified temporal framework handles dynamic processes and changes
J
Jingqi Tong
Fudan University
Y
Yurong Mou
Fudan University
H
Hangcheng Li
Fudan University
M
Mingzhe Li
Shanghai Innovation Institute
Y
Yongzhuo Yang
Fudan University
M
Ming Zhang
Fudan University
Qiguang Chen
Qiguang Chen
Harbin Institute of Technology
Chain-of-ThoughtReasoningMultilingual LLMMulti-modal LLM
Tianyi Liang
Tianyi Liang
PHD, East China Normal University, Shanghai AI Lab,Shanghai Innovation Institute
Multimodal LearningLLMsImage Editing
X
Xiaomeng Hu
The Chinese University of Hong Kong
Y
Y. Zheng
Fudan University
Xinchi Chen
Xinchi Chen
Professor at Fudan University, Shanghai, China
Large Language ModelsEmbodied AINatural Language ProcessingInformation Retrievaletc.
J
Jun Zhao
Fudan University
X
Xuanjing Huang
Fudan University
X
Xipeng Qiu
Shanghai Innovation Institute