🤖 AI Summary
In text-intensive video question answering, single-frame static perception leads to missed fine-grained textual cues and hallucination. Method: We propose Visual Rumination—a cognitive-inspired mechanism that iteratively selects frames, zooms into salient regions, and re-encodes local features, mimicking human visual scrutiny. To support this, we construct the first executable rumination trajectory dataset and design a multi-stage training framework for a 7B-scale multimodal large language model (MLLM), jointly optimizing atomic visual operations (e.g., zoom, jump) and hybrid visual actions via supervised fine-tuning (SFT) and GRPO-based reinforcement learning. Contribution/Results: Our method achieves state-of-the-art performance on M4-ViteVQA and demonstrates strong generalization to multi-page documents, slides, and general video QA tasks—establishing a new paradigm for fine-grained multimodal reasoning.
📝 Abstract
Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.