Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Diffusion models struggle to accurately capture users’ fine-grained preferences in multi-turn human-AI visual dialogues. Method: This paper proposes the Vision-Coordinated Adaptation (VCA) framework—the first to jointly optimize generative diversity, cross-turn consistency, and alignment with human preferences in multi-turn text–image dialogues. We construct the first high-quality multi-turn text–image dialogue dataset, integrate human-in-the-loop feedback, a pre-trained reward model, and a multi-objective reward function, and employ LoRA for efficient fine-tuning. Contributions/Results: Experiments demonstrate that VCA significantly improves image–intent alignment and inter-turn generation stability. It outperforms all existing state-of-the-art methods on both preference alignment and user satisfaction metrics, establishing new benchmarks for controllable, preference-aware multimodal dialogue generation.

Technology Category

Application Category

📝 Abstract

Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-round training for user-aligned image generation

Enhancing diffusion models with human feedback and reward functions

Improving image consistency and preference alignment in dialogues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Co-Adaptation framework with human feedback

Multi-round dialogue dataset for fine-tuning

LoRA-optimized diffusion model with reward functions

🔎 Similar Papers

No similar papers found.