TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing few-step diffusion-based reinforcement learning methods, which rely on differentiable rewards and thus struggle to incorporate non-differentiable real-world signals such as human preferences or object counts. To overcome this, the authors propose TDM-R1, a novel few-step diffusion reinforcement learning paradigm that decouples surrogate reward learning from generator training and constructs stepwise rewards along deterministic generation trajectories, thereby enabling the first general-purpose support for non-differentiable rewards. Built upon Trajectory Distribution Matching (TDM), TDM-R1 unifies surrogate reward modeling with policy gradient optimization into a cohesive post-training framework. Experiments demonstrate that TDM-R1 achieves state-of-the-art performance on tasks involving text rendering, visual quality, and preference alignment, surpassing both the 100-step and few-step variants of Z-Image using only four inference steps.

Technology Category

Application Category

📝 Abstract
While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans'binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models'ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1
Problem

Research questions and friction points this paper is trying to address.

few-step diffusion models
non-differentiable reward
reinforcement learning
generative models
reward alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

few-step diffusion models
non-differentiable rewards
reinforcement learning
Trajectory Distribution Matching
reward decoupling
🔎 Similar Papers
2024-07-16arXiv.orgCitations: 2
2024-10-07arXiv.orgCitations: 9