🤖 AI Summary
Text-to-image (T2I) diffusion models suffer from poor generalization and inconsistent reward signals when aligning with multiple human preferences. Method: This paper proposes a fully automated, annotation-free framework for multi-reward co-optimization. It introduces a novel reward calibration mechanism to harmonize heterogeneous preference scales, integrated with Pareto-frontier-guided paired sampling and regression-based preference optimization to ensure consistency and resolve conflicts across multiple rewards. The technical pipeline comprises multi-reward ensemble integration, calibration-driven preference learning, and end-to-end optimization targeting generative quality. Results: Experiments demonstrate that our method consistently outperforms baselines—including DPO—on GenEval and T2I-Compbench, significantly improving both image fidelity and alignment with human preferences under both single- and multi-reward settings. To our knowledge, this is the first approach enabling robust, scalable T2I alignment driven by heterogeneous, multi-source reward signals.
📝 Abstract
Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.