🤖 AI Summary
Text-to-image (T2I) generation confronts inherent trade-offs—e.g., prompt fidelity versus creativity, diversity versus consistency, and ethical compliance versus aesthetic quality—that resist simultaneous optimization via single-objective methods. To address this, we propose YinYangAlign, the first framework to formally define six antagonistic alignment objectives and construct an axiomatized benchmark dataset annotated with human-elicited explanations of such conflicts. We further introduce multi-objective direct preference optimization (MO-DPO), a novel paradigm integrating conflict-aware preference modeling with a human-feedback-driven evaluation system. Extensive evaluation across mainstream T2I models demonstrates that YinYangAlign significantly improves cross-dimensional alignment balance—increasing Pareto-front coverage by 18.7%—while simultaneously enhancing both ethical adherence and aesthetic quality.
📝 Abstract
Precise alignment in Text-to-Image (T2I) systems is crucial to ensure that generated visuals not only accurately encapsulate user intents but also conform to stringent ethical and aesthetic benchmarks. Incidents like the Google Gemini fiasco, where misaligned outputs triggered significant public backlash, underscore the critical need for robust alignment mechanisms. In contrast, Large Language Models (LLMs) have achieved notable success in alignment. Building on these advancements, researchers are eager to apply similar alignment techniques, such as Direct Preference Optimization (DPO), to T2I systems to enhance image generation fidelity and reliability. We present YinYangAlign, an advanced benchmarking framework that systematically quantifies the alignment fidelity of T2I systems, addressing six fundamental and inherently contradictory design objectives. Each pair represents fundamental tensions in image generation, such as balancing adherence to user prompts with creative modifications or maintaining diversity alongside visual coherence. YinYangAlign includes detailed axiom datasets featuring human prompts, aligned (chosen) responses, misaligned (rejected) AI-generated outputs, and explanations of the underlying contradictions.