VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of weak video reasoning capabilities, scarcity of high-quality reasoning data, and absence of effective training paradigms, this work introduces two novel video reasoning benchmarks: DarkEventInfer (event-masking inference) and MixVidQA (cross-clip interference question answering). It pioneers the extension of the Reason-Then-Respond paradigm to general-purpose multimodal video reasoning, supporting multiple-choice and open-ended QA as well as video captioning. Methodologically, we propose spatiotemporal feature disentangled encoding, context-aware event completion, and a multi-stage reinforcement learning fine-tuning strategy guided by diversity-based rewards. Our approach achieves comprehensive performance gains across three major evaluation categories—video understanding, cognitive reasoning, and captioning—outperforming all existing methods and establishing new state-of-the-art results on multiple metrics.

Technology Category

Application Category

📝 Abstract
Recent advancements in multimodal large language models have successfully extended the Reason-Then-Respond paradigm to image-based reasoning, yet video-based reasoning remains an underdeveloped frontier, primarily due to the scarcity of high-quality reasoning-oriented data and effective training methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA, two novel datasets specifically designed to stimulate the model's advanced video understanding and reasoning abilities. DarkEventinfer presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues. MixVidQA, on the other hand, presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. Leveraging these carefully curated training samples together with reinforcement learning guided by diverse reward functions, we develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm capable of handling multiple-choice and open-ended question answering, as well as video captioning tasks. Extensive experiments demonstrate that VersaVid-R1 significantly outperforms existing models across a broad spectrum of benchmarks, covering video general understanding, cognitive reasoning, and captioning tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality video reasoning data and methods
Need for advanced video understanding and reasoning models
Underdeveloped video-based reasoning in multimodal language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DarkEventInfer for masked event inference
Develops MixVidQA for interleaved video reasoning
Uses reinforcement learning with diverse rewards
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
2024-08-08International Journal of Computer VisionCitations: 13
X
Xinlong Chen
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
Yushuo Guan
Yushuo Guan
Peking University
VLMDiffusion Model
Bohan Zeng
Bohan Zeng
PhD student, Peking University
Data-Centric AIComputer VisionDiffusion Model3D
Y
Yang Shi
Peking University
Sihan Yang
Sihan Yang
Xi’an Jiaotong University
Medical image analysisMultimodal large language model
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
Q
Qiang Liu
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences
L
Liang Wang
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences
Tieniu Tan
Tieniu Tan
Institute of Automation, Chinese Academy of Sciences