Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two critical limitations of Group Relative Policy Optimization (GRPO) in flow-matching image generation—(i) inaccurate temporal credit assignment due to sparse terminal rewards, and (ii) optimization stagnation caused by intra-group relative reward decay—this paper proposes Value-Grounded Group Policy Optimization (VGPO). Methodologically, VGPO introduces *temporally dense value modeling*, enabling fine-grained, process-aware credit assignment across diffusion steps; and *absolute-value-guided group normalization*, which enhances reward diversity and convergence robustness. Evaluated on three standard benchmarks, VGPO achieves state-of-the-art performance in both image quality (e.g., FID, CLIP Score) and task accuracy, while effectively mitigating reward hacking, improving training stability, and increasing robustness to distributional shifts (e.g., domain or noise-level variations).

Technology Category

Application Category

📝 Abstract
Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking. Project webpage: https://yawen-shao.github.io/VGPO/.
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal credit assignment in flow matching models
Mitigates optimization stagnation from depleted reward diversity
Enhances image generation quality and task-specific accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense process-aware value estimates for credit assignment
Absolute value-enhanced group normalization for stable optimization
Temporal and group dimensions redefined for flow matching
🔎 Similar Papers
No similar papers found.
Y
Yawen Shao
University of Science and Technology of China
Jie Xiao
Jie Xiao
University of Science and Technology of China
low level visiongenerative modelmachine learning
K
Kai Zhu
University of Science and Technology of China
Y
Yu Liu
Tongyi Lab
W
Wei Zhai
University of Science and Technology of China
Y
Yang Cao
University of Science and Technology of China
Z
Zheng-Jun Zha
University of Science and Technology of China