Unified Personalized Reward Model for Vision Generation

📅 2026-02-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-based reward models, which rely on a uniform preference assumption and struggle to capture subjective, context-dependent user preferences, leading to systematic deviations from human judgments. To overcome this, we propose UnifiedReward-Flex, the first framework to introduce a flexible and interpretable hierarchical evaluation mechanism that dynamically constructs context-adaptive, personalized criteria by integrating semantic intent understanding with visual evidence. It supports joint optimization over both predefined and self-generated evaluation dimensions. The method employs a two-stage training strategy: first, supervised fine-tuning (SFT) using high-quality reasoning trajectories distilled from a closed-source vision-language model, followed by direct preference optimization (DPO) to enhance discriminative alignment. Integrated into the GRPO framework, UnifiedReward-Flex significantly outperforms current reward models in both image and video generation tasks, effectively improving preference alignment and generation quality.

Technology Category

Application Category

📝 Abstract
Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
Problem

Research questions and friction points this paper is trying to address.

multimodal reward models
personalized preference
vision generation
context-dependent alignment
subjective evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized reward model
context-adaptive reasoning
hierarchical assessment
reasoning distillation
direct preference optimization
🔎 Similar Papers
No similar papers found.