🤖 AI Summary
Text–image alignment remains a critical challenge for diffusion models, and existing RLHF-based approaches heavily rely on costly human-provided image preference annotations, severely limiting scalability. To address this, we propose Text Preference Optimization (TPO), the first framework enabling “free” alignment without paired image preference labels. TPO leverages large language models to automatically generate semantically mismatched perturbed texts, thereby constructing text-level contrastive signals. We further extend DPO and KTO into Text-DPO (TDPO) and Text-KTO (TKTO), specifically tailored for text-guided generation tasks. Crucially, TPO eliminates dependence on manual image annotations while ensuring strong generalizability and scalability. Extensive experiments across multiple benchmarks demonstrate significant improvements in both quantitative text–image alignment accuracy and human preference scores, validating its effectiveness and practical utility.
📝 Abstract
Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.