Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing GRPO methods in text-to-image generation, which rely on global rewards and thus fail to discern the local contributions of individual denoising steps or capture long-term dependencies within the generation trajectory, resulting in sparse rewards and imprecise learning signals. To overcome this, the authors propose TurningPoint-GRPO (TP-GRPO), a novel approach that introduces step-level incremental rewards to provide dense learning signals and automatically identifies “turning point” steps—those exerting sustained influence on subsequent states—by dynamically detecting sign changes in reward trends without requiring additional hyperparameters. Integrated within a Flow Matching framework, TP-GRPO jointly optimizes short-term contributions and long-term effects, significantly improving both generation quality and training efficiency, as demonstrated by extensive experiments confirming its effectiveness and stability.

Technology Category

Application Category

📝 Abstract
Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's"pure"effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.
Problem

Research questions and friction points this paper is trying to address.

Sparse Rewards
Flow-Based GRPO
Step-Wise Effects
Long-Term Dependencies
Text-to-Image Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

TurningPoint-GRPO
step-wise reward
long-term effect
flow-based generative model
sparse reward alleviation
🔎 Similar Papers
No similar papers found.
Y
Yunze Tong
Zhejiang University, Hangzhou, China
Mushui Liu
Mushui Liu
Zhejiang University
Generative ModelsMulti-modal LearningFew-shot Learning
Canyu Zhao
Canyu Zhao
Zhejiang University
Generative ModelDeep Learning
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
Shiyi Zhang
Shiyi Zhang
Tsinghua University
Video GenerationVideo Understanding
H
Hongwei Zhang
Zhejiang University, Hangzhou, China
Peng Zhang
Peng Zhang
Tongyi Lab, Alibaba Group
Computer Vision & Motion & Animation
J
Jinlong Liu
Alibaba Group, Hangzhou, China
J
Ju Huang
Alibaba Group, Hangzhou, China
J
Jiamang Wang
Alibaba Group, Hangzhou, China
Hao Jiang
Hao Jiang
Alibaba Group
LLM & AIGC
P
Pipei Huang
Alibaba Group, Hangzhou, China