VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

208K/year

📝 Abstract

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in multimodal reward models for video generation.

Enhances visual reasoning fidelity within constrained context budgets.

Reduces hallucination and forgetting during chain-of-thought reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces thinking-with-image framework with visual reasoning operations

Uses reinforcement fine-tuning pipeline to enhance reasoning skills

Implements configurable visual memory window for active evidence management

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

2024-06-19arXiv.orgCitations: 5

StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models

2024-08-31arXiv.orgCitations: 3