Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

In text-intensive video question answering, single-frame static perception leads to missed fine-grained textual cues and hallucination. Method: We propose Visual Rumination—a cognitive-inspired mechanism that iteratively selects frames, zooms into salient regions, and re-encodes local features, mimicking human visual scrutiny. To support this, we construct the first executable rumination trajectory dataset and design a multi-stage training framework for a 7B-scale multimodal large language model (MLLM), jointly optimizing atomic visual operations (e.g., zoom, jump) and hybrid visual actions via supervised fine-tuning (SFT) and GRPO-based reinforcement learning. Contribution/Results: Our method achieves state-of-the-art performance on M4-ViteVQA and demonstrates strong generalization to multi-page documents, slides, and general video QA tasks—establishing a new paradigm for fine-grained multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Addresses video reasoning challenges with transient textual cues

Reduces hallucinations in video QA models through iterative inspection

Enables fine-grained evidence capture via visual rumination mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively selects frames and zooms regions

Multi-stage learning with SFT and RL

Achieves state-of-the-art on text-rich video QA

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models