Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL)-based post-training methods for visual generation (e.g., GRPO) rely solely on sample-level scalar rewards, ignoring the intrinsic spatio-temporal structure of images/videos and thus failing to localize and rectify fine-grained artifacts. Method: We propose a fine-grained RL optimization framework that extends scalar preference feedback into pixel-level advantage maps. Leveraging a pre-trained vision backbone, we extract structured perceptual features to enable spatially and temporally differentiated policy updates. Our lightweight, architecture-agnostic pixel-level policy optimization preserves GRPO’s training stability while enabling localized correction. Results: Experiments demonstrate significant improvements over standard GRPO across both image and video generation tasks—achieving higher human preference alignment, enhanced fidelity of local details, and superior cross-domain generalization—without increasing computational overhead or compromising training dynamics.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.
Problem

Research questions and friction points this paper is trying to address.

Enhancing visual generation by replacing scalar rewards with structured pixel-level feedback
Addressing localized artifacts and fine-grained perceptual cues in visual content
Improving alignment with human preferences through spatially-aware advantage maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-level advantage maps replace scalar rewards
Pretrained vision backbones construct perceptual structure
Architecture-agnostic method maintains GRPO stability
🔎 Similar Papers
No similar papers found.
Ziqi Ni
Ziqi Ni
Southeast University
Computer VisionGenerative AI
Yuanzhi Liang
Yuanzhi Liang
UTS
R
Rui Li
Institute of Artificial Intelligence (TeleAI), China Telecom, University of Science and Technology of China
Y
Yi Zhou
Southeast University
H
Haibing Huang
Institute of Artificial Intelligence (TeleAI), China Telecom
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom