Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses critical limitations in existing video reward models, which either suffer from shortcut learning in discriminative approaches or exhibit training instability due to the tight coupling of reasoning and scoring in generative chain-of-thought methods. To overcome these issues, the authors propose DeScore, the first framework to explicitly decouple chain-of-thought reasoning from reward prediction: a multimodal large language model first generates an explicit reasoning chain, which is then processed by a dedicated discriminative module to predict the reward. DeScore employs a two-stage optimization framework enhanced with learnable query tokens, a regression head, stochastic masking, and a dual-objective reinforcement learning strategy, significantly improving training stability and efficiency while preserving generalization. Experiments demonstrate that DeScore achieves superior alignment with human preferences across diverse video scenarios, outperforming both discriminative and generative reward models.

📝 Abstract

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.

Problem

Research questions and friction points this paper is trying to address.

video reward modeling

Chain-of-Thought reasoning

decoupled reasoning and scoring

multimodal large language models

human preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Reasoning

Video Reward Modeling

Chain-of-Thought