DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the challenges of task interference, imbalance, and catastrophic forgetting in multitask reinforcement learning with diffusion models. The authors propose DiffusionOPD, the first approach to extend online policy distillation to continuous-state Markov processes. It decouples exploration from knowledge integration by first training task-specific teacher policies independently and then distilling their behaviors into a unified student policy along the student’s own trajectories. The method derives a closed-form KL divergence objective for unified stochastic and ordinary differential equation (SDE/ODE) formulations, yielding low-variance, highly generalizable analytical gradients, and incorporates a mean-matching mechanism to stabilize learning. Experiments demonstrate that DiffusionOPD significantly outperforms multitask reward-based RL and cascaded RL baselines in both training efficiency and final performance, achieving state-of-the-art results across all evaluated benchmarks.

📝 Abstract

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.

Problem

Research questions and friction points this paper is trying to address.

multi-task reinforcement learning

diffusion models

cross-task interference

catastrophic forgetting

text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Policy Distillation

Diffusion Models

Multi-task Reinforcement Learning