🤖 AI Summary
This work addresses the limitations of existing vision-based reward models, which rely on a uniform preference assumption and struggle to capture subjective, context-dependent user preferences, leading to systematic deviations from human judgments. To overcome this, we propose UnifiedReward-Flex, the first framework to introduce a flexible and interpretable hierarchical evaluation mechanism that dynamically constructs context-adaptive, personalized criteria by integrating semantic intent understanding with visual evidence. It supports joint optimization over both predefined and self-generated evaluation dimensions. The method employs a two-stage training strategy: first, supervised fine-tuning (SFT) using high-quality reasoning trajectories distilled from a closed-source vision-language model, followed by direct preference optimization (DPO) to enhance discriminative alignment. Integrated into the GRPO framework, UnifiedReward-Flex significantly outperforms current reward models in both image and video generation tasks, effectively improving preference alignment and generation quality.
📝 Abstract
Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.