D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing few-step distilled diffusion models tend to lose their efficient inference capabilities during continual supervised fine-tuning. To address this issue, this work proposes an intra-policy self-distillation training paradigm, wherein the model acts as both teacher and student under distinct conditioning settings: the student relies solely on textual features, while the teacher leverages multimodal features combining text and the target image. Continual learning is achieved by minimizing the discrepancy between their predictive distributions along the student’s generation trajectory. This approach represents the first integration of intra-policy learning into the fine-tuning of few-step distilled diffusion models, effectively enabling continual acquisition of new concepts and styles while preserving the original model’s efficiency and high-quality generation performance.

📝 Abstract

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.

Problem

Research questions and friction points this paper is trying to address.

few-step diffusion models

continuous fine-tuning

inference efficiency

step-distilled models

supervised fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy self-distillation

step-distilled diffusion models

few-step inference