🤖 AI Summary
Existing diffusion models struggle with credit assignment along denoising trajectories and unstable value optimization when aligning with non-differentiable objectives. This work proposes a state-aligned latent-space Actor-Critic framework that, for the first time, leverages the diffusion model itself as a timestep-conditional value function to directly predict values on noisy latent states. The approach enables trajectory-level PPO training and allows the trained Critic to be directly employed for inference-time guidance. It further extends to joint optimization over multiple rewards to mitigate reward gaming. Evaluated on both UNet and DiT backbones, the method consistently outperforms current reinforcement learning and Actor-Critic baselines across single- and multi-reward benchmarks, achieving substantially improved generation quality, with additional performance gains attainable through test-time guidance.
📝 Abstract
Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.