Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models

📅 2024-10-28
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of simultaneously achieving high image quality, inference efficiency, and alignment with human preferences in text-to-image generation, this paper introduces Diff-Instruct*, the first online reinforcement learning–based one-step generation framework that requires no image-level supervision. Innovatively replacing KL divergence with score divergence regularization, we derive an equivalent, efficiently optimizable gradient objective—enabling, for the first time, joint optimization of fidelity and human preference within a single diffusion step. Our method integrates RLHF, score regularization, and knowledge distillation from Stable Diffusion-XL to construct DI*-SDXL-1step (2.6B parameters). At 1024×1024 resolution, it achieves inference latency仅为 1.88% of FLUX-dev (50-step) and reduces GPU memory consumption to 29.3%, while outperforming all prior state-of-the-art methods across four human preference benchmarks—including PickScore.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce the Diff-Instruct* (DI*), an image data-free approach for building one-step text-to-image generative models that align with human preference while maintaining the ability to generate highly realistic images. We frame human preference alignment as online reinforcement learning using human feedback (RLHF), where the goal is to maximize the reward function while regularizing the generator distribution to remain close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization, which leads to significantly better performances. Although the direct calculation of this preference alignment objective remains intractable, we demonstrate that we can efficiently compute its gradient by deriving an equivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to train a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step text-to-image model, which can generate images of a resolution of 1024x1024 with only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference time and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly in PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1 on Human Preference Score benchmark, establishing a new state-of-the-art benchmark of human-preferred 1-step text-to-image generative models. Besides the strong quantitative performances, extensive qualitative comparisons also confirm the advantages of DI* in terms of maintaining diversity, improving image layouts, and enhancing aesthetic colors. We have released our industry-ready model on the homepage: url{https://github.com/pkulwj1994/diff_instruct_star}.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Image Generation
Efficiency
Resource Consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diff-Instruct*
reinforcement learning
computational efficiency
🔎 Similar Papers
No similar papers found.