Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual reward models struggle to simultaneously achieve alignment with human preferences, semantic consistency, and efficient inference in complex image editing tasks. This work proposes a unified training framework based on a shared vision-language backbone, which, for the first time, internalizes chain-of-thought reasoning capabilities within a discriminative reward model. By jointly optimizing discriminative preference learning and generative language modeling, the approach unifies semantic understanding and preference alignment. The method achieves state-of-the-art performance on both MMRB2 and EditReward-Bench benchmarks and significantly enhances the stability and sample efficiency of downstream online reinforcement learning.

Technology Category

Application Category

📝 Abstract
Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.
Problem

Research questions and friction points this paper is trying to address.

reward modeling
visual reward models
semantic consistency
human preference alignment
reasoning supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Reward Modeling
Chain-of-Thought Reasoning
Vision-Language Backbone
Preference Learning
Efficient Reward Models
Y
Yankai Yang
Harbin Institute of Technology, Shenzhen
Y
Yancheng Long
Harbin Institute of Technology, Shenzhen
H
Hongyang Wei
Tsinghua Shenzhen International Graduate School, Tsinghua University
Wei Chen
Wei Chen
HKUST
Computer VisionVision-Language
Tianke Zhang
Tianke Zhang
Tsinghua University; Kuaishou Technology
Computer VisionNeuro-Linguistic Programming
Kaiyu Jiang
Kaiyu Jiang
Kuaishou
MLLM
H
Haonan Fan
Kuaishou Technology
C
Changyi Liu
Kuaishou Technology
J
Jiankang Chen
Kuaishou Technology
K
Kaiyu Tang
Kuaishou Technology
Bin Wen
Bin Wen
快手
MLLM
F
Fan Yang
Kuaishou Technology
T
Tingting Gao
Kuaishou Technology
H
Han Li
Kuaishou Technology
Shuo Yang
Shuo Yang
Professor, Harbin Institute of Technology (Shenzhen)
Data-Centric AITrustworthy AIMachine LearningComputer Vision