Explicit Critic Guidance for Aligning Diffusion Models

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing diffusion models struggle with credit assignment along denoising trajectories and unstable value optimization when aligning with non-differentiable objectives. This work proposes a state-aligned latent-space Actor-Critic framework that, for the first time, leverages the diffusion model itself as a timestep-conditional value function to directly predict values on noisy latent states. The approach enables trajectory-level PPO training and allows the trained Critic to be directly employed for inference-time guidance. It further extends to joint optimization over multiple rewards to mitigate reward gaming. Evaluated on both UNet and DiT backbones, the method consistently outperforms current reinforcement learning and Actor-Critic baselines across single- and multi-reward benchmarks, achieving substantially improved generation quality, with additional performance gains attainable through test-time guidance.

📝 Abstract

Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

Problem

Research questions and friction points this paper is trying to address.

diffusion models

reinforcement learning

credit assignment

value-based optimization

reward alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models

actor-critic

reinforcement learning