Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing reinforcement learning (RL)-based post-training methods for visual generation (e.g., GRPO) rely solely on sample-level scalar rewards, ignoring the intrinsic spatio-temporal structure of images/videos and thus failing to localize and rectify fine-grained artifacts. Method: We propose a fine-grained RL optimization framework that extends scalar preference feedback into pixel-level advantage maps. Leveraging a pre-trained vision backbone, we extract structured perceptual features to enable spatially and temporally differentiated policy updates. Our lightweight, architecture-agnostic pixel-level policy optimization preserves GRPO’s training stability while enabling localized correction. Results: Experiments demonstrate significant improvements over standard GRPO across both image and video generation tasks—achieving higher human preference alignment, enhanced fidelity of local details, and superior cross-domain generalization—without increasing computational overhead or compromising training dynamics.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual generation by replacing scalar rewards with structured pixel-level feedback

Addressing localized artifacts and fine-grained perceptual cues in visual content

Improving alignment with human preferences through spatially-aware advantage maps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-level advantage maps replace scalar rewards

Pretrained vision backbones construct perceptual structure

Architecture-agnostic method maintains GRPO stability

🔎 Similar Papers

No similar papers found.