Dual Caption Preference Optimization for Diffusion Models

📅 2025-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two prevalent challenges in preference optimization for text-to-image diffusion models—(1) distributional overlap (i.e., conflict distributions) between preferred and non-preferred samples, and (2) degradation of noise prediction due to irrelevant information in prompts—this paper proposes Dual-Caption Preference Optimization (DCPO). DCPO introduces a novel dual-caption modeling mechanism that generates semantically aligned, independent captions for preferred and non-preferred images. We construct the Pick-Double Caption dataset and design three caption-discrimination strategies—caption generation, perturbation, and mixing—to effectively decouple the conflicting distributions. Built upon Stable Diffusion 2.1, DCPO integrates DPO/MaPO preference learning paradigms, multi-strategy caption engineering, and multi-dimensional reward modeling (e.g., CLIP, ImageReward). Extensive evaluation shows DCPO consistently outperforms SD 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across six metrics—including PickScore, HPSv2.1, and CLIPScore—demonstrating significant improvements in both image quality and prompt fidelity.

Technology Category

Application Category

📝 Abstract
Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a conflict distribution. Additionally, we identified that input prompts contain irrelevant information for less preferred images, limiting the denoising network's ability to accurately predict noise in preference optimization methods, known as the irrelevant prompt issue. To address these challenges, we propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the Pick-Double Caption dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.
Problem

Research questions and friction points this paper is trying to address.

Mitigate irrelevant prompts in diffusion models
Address conflict distribution in preference datasets
Improve image quality and prompt relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Caption Preference Optimization
Pick-Double Caption dataset
Captioning, perturbation, hybrid methods
🔎 Similar Papers