Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenge of aligning few-step text-to-image generation with human preferences, a task often hindered by sparse and imprecise reward signals in existing reinforcement learning approaches. To overcome this limitation, the authors propose the TAFS-GRPO framework, which introduces a novel step-aware advantage mechanism that delivers dense, step-specific supervision signals without requiring differentiable reward functions. The framework integrates temperature-annealed few-step sampling (TAF), grouped relative policy optimization (GRPO), and adaptive temporal noise injection to balance semantic fidelity with controlled stochasticity. Experimental results demonstrate that TAFS-GRPO significantly outperforms current state-of-the-art methods in few-step generation, achieving superior alignment with human preferences, enhanced image quality, and improved training stability.

Technology Category

Application Category

📝 Abstract

Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model's sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.

Problem

Research questions and friction points this paper is trying to address.

flow matching

reinforcement learning

human preference alignment

few-step generation

reward sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching

few-step generation

reinforcement learning