Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GRPO methods for text-to-image generation face two key challenges: (1) inaccurate shared credit assignment—trajectory-level advantages derived from sparse terminal rewards via group normalization are uniformly backpropagated, failing to capture the exploratory potential of early denoising steps; and (2) conflicting multi-objective reward mixing—predefined weighted fusion of heterogeneous rewards (e.g., text alignment, visual quality, color fidelity), differing markedly in scale and variance, induces gradient instability. This paper proposes Multi-GRPO: a novel tree-structured trajectory design enabling temporal grouping for fine-grained advantage estimation at early steps; reward-type-specific independent grouping and normalization to decouple multi-objective optimization conflicts; and MCTS-inspired sampling combined with sparse terminal rewards for stable policy updates. Evaluated on PickScore-25k and OCR-Color-10, Multi-GRPO significantly improves multi-objective alignment accuracy and training stability.

Technology Category

Application Category

📝 Abstract
Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) extit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) extit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)--which have mismatched scales and variances--lead to unstable gradients and conflicting updates. To address these issues, we propose extbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce emph{reward-based grouping} to compute advantages for each reward function extit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate extit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward extit{PickScore-25k} and multi-objective extit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.
Problem

Research questions and friction points this paper is trying to address.

Improves credit assignment in text-to-image generation by addressing uniform advantage estimation across timesteps
Solves reward-mixing issues from combining multi-objective rewards with mismatched scales and variances
Enhances alignment of text-to-image models by balancing conflicting objectives like accuracy and quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based trajectories enable temporal grouping for credit assignment
Reward-based grouping independently computes advantages before aggregation
Multi-group framework addresses shared credit and reward-mixing limitations
🔎 Similar Papers
No similar papers found.
Q
Qiang Lyu
University of Chinese Academy of Sciences
Z
Zicong Chen
Beihang University
C
Chongxiao Wang
Alibaba Group
Haolin Shi
Haolin Shi
University of Science and Technology of China
3D AIGCComputer Vision
S
Shibo Gao
Beijing Jiaotong University
R
Ran Piao
Alibaba Group
Youwei Zeng
Youwei Zeng
TikTok Inc.
Jianlou Si
Jianlou Si
alibaba-inc.com
MLLMGenAIAGIEmbodied AI
Fei Ding
Fei Ding
Unknown affiliation
J
Jing Li
Alibaba Group
Chun Pong Lau
Chun Pong Lau
Assistant Professor, City University of Hong Kong
Image ProcessingScientific ComputingDeep LearningAdversarial Robustness
W
Weiqiang Wang
University of Chinese Academy of Sciences