🤖 AI Summary
Diffusion models struggle to accurately capture users’ fine-grained preferences in multi-turn human-AI visual dialogues. Method: This paper proposes the Vision-Coordinated Adaptation (VCA) framework—the first to jointly optimize generative diversity, cross-turn consistency, and alignment with human preferences in multi-turn text–image dialogues. We construct the first high-quality multi-turn text–image dialogue dataset, integrate human-in-the-loop feedback, a pre-trained reward model, and a multi-objective reward function, and employ LoRA for efficient fine-tuning. Contributions/Results: Experiments demonstrate that VCA significantly improves image–intent alignment and inter-turn generation stability. It outperforms all existing state-of-the-art methods on both preference alignment and user satisfaction metrics, establishing new benchmarks for controllable, preference-aware multimodal dialogue generation.
📝 Abstract
Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.