🤖 AI Summary
Preference-based fine-tuning of text-to-image diffusion models suffers from low-quality preferences, sparse and uninformative feedback signals, poor interpretability, reward hacking, and overfitting—largely due to reliance on opaque, black-box reward models.
Method: We propose a preference optimization framework grounded in multidimensional, human-interpretable feedback. Leveraging large language models, we automatically generate critical image critiques and convert them into executable editing instructions (e.g., for ControlNet or inpainting), thereby synthesizing high-information, fine-grained preference pairs aligned with human intent. Crucially, we bypass explicit reward modeling and instead integrate a DPO variant directly into the diffusion model for preference learning.
Results: Evaluated on SDXL and other mainstream diffusion models, our approach significantly improves image fidelity, text-image alignment, and controllability. Preference pair efficacy increases by 42% over Diffusion-DPO, demonstrating superior generalization and interpretability.
📝 Abstract
We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.