SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

πŸ“… 2026-02-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of aligning diffusion models with human preferences under the constraints of absent reward models and prohibitively expensive large-scale human preference data. To this end, we propose SAIL, a novel framework that reveals, for the first time, the intrinsic self-improvement capability of diffusion models. SAIL employs a closed-loop iterative process in which the model generates its own samples, self-annotates preferences, and refines itself accordingly, augmented by a ranked preference mixup strategy to balance exploration with retention of initial human priors. Experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using only 6% of the preference data required by existing approaches, substantially reducing reliance on costly human annotations.

Technology Category

Application Category

πŸ“ Abstract
Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
Problem

Research questions and friction points this paper is trying to address.

diffusion model alignment
minimal human feedback
reward-free learning
preference learning
human-AI alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models
preference alignment
self-amplified learning
minimal human feedback
iterative self-improvement
πŸ”Ž Similar Papers
No similar papers found.