Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

📅 2024-06-06

📈 Citations: 12

✨ Influential: 5

career value

180K/year

🤖 AI Summary

Addressing the challenges of improving aesthetic quality in text-to-image generation—namely, difficulty in aesthetic enhancement, reliance on domain-specific annotations, and inability to model fine-grained visual differences—this paper proposes Stepwise Preference Optimization (SPO). SPO operates at each denoising step of diffusion models: it constructs a candidate image pool via shared-noise sampling and employs a lightweight gait-aware preference model to dynamically select win–loss pairs for supervision, eliminating conventional two-trajectory label propagation. This enables fine-grained, stepwise aesthetic modeling without requiring aesthetic-specific annotations. Leveraging only general image preference data, SPO significantly enhances the aesthetic quality of Stable Diffusion v1.5 and SDXL. Experiments demonstrate that SPO achieves higher aesthetic scores than state-of-the-art DPO methods while preserving text–image alignment and exhibiting faster convergence. The code and models are publicly available.

Technology Category

Application Category

📝 Abstract

Generating visually appealing images is fundamental to modern text-to-image generation models. A potential solution to better aesthetics is direct preference optimization (DPO), which has been applied to diffusion models to improve general image quality including prompt alignment and aesthetics. Popular DPO methods propagate preference labels from clean image pairs to all the intermediate steps along the two generation trajectories. However, preference labels provided in existing datasets are blended with layout and aesthetic opinions, which would disagree with aesthetic preference. Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps. To improve aesthetics economically, this paper uses existing generic preference data and introduces step-by-step preference optimization (SPO) that discards the propagation strategy and allows fine-grained image details to be assessed. Specifically, at each denoising step, we 1) sample a pool of candidates by denoising from a shared noise latent, 2) use a step-aware preference model to find a suitable win-lose pair to supervise the diffusion model, and 3) randomly select one from the pool to initialize the next denoising step. This strategy ensures that the diffusion models to focus on the subtle, fine-grained visual differences instead of layout aspect. We find that aesthetic can be significantly enhanced by accumulating these improved minor differences. When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields significant improvements in aesthetics compared with existing DPO methods while not sacrificing image-text alignment compared with vanilla models. Moreover, SPO converges much faster than DPO methods due to the step-by-step alignment of fine-grained visual details. Code and models are available at https://github.com/RockeyCoss/SPO.

Problem

Research questions and friction points this paper is trying to address.

Improving image aesthetics in diffusion models efficiently

Addressing blended preference labels in existing datasets

Enhancing fine-grained visual details without layout compromise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-by-step preference optimization (SPO)

Discards propagation strategy for fine-grained assessment

Uses step-aware preference model for win-lose pairs

🔎 Similar Papers

DiffArtist: Towards Structure and Appearance Controllable Image Stylization