TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

๐Ÿ“… 2025-12-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion models face significant computational bottlenecks in reinforcement learning (RL) post-training, hindering widespread adoption. To address this, we propose Tree-RL: a novel RL framework that models the denoising process as a search tree, enabling parallel generation of multiple candidate trajectories via shared noise prefixesโ€”thus supporting multiple policy updates per forward pass. We introduce a tree-structured advantage mechanism that enables step-level, fine-grained reward attribution and advantage estimation. Built upon the GRPO framework, Tree-RL integrates tree-based trajectory sampling, prefix reuse, multi-branch expansion, and efficient gradient backpropagation through the tree. Extensive experiments across multiple benchmarks and reward models demonstrate that Tree-RL consistently outperforms baseline methods, achieving a 2.4ร— speedup in training while attaining superior trade-offs among sample efficiency, training throughput, and generation quality.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce extbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) emph{High sample efficiency}, achieving better performance under same training samples (2) emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves extbf{2.4$ imes$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.
Problem

Research questions and friction points this paper is trying to address.

Improves RL post-training efficiency for diffusion models
Enables fine-grained credit assignment in reward propagation
Reduces computational cost while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates denoising as a search tree for efficiency
Uses reward backpropagation for fine-grained credit assignment
Enables multiple policy updates per forward pass
๐Ÿ”Ž Similar Papers
No similar papers found.