Enhancing Intent Understanding for Ambiguous Prompts through Human-Machine Co-Adaptation

📅 2025-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address ambiguous user prompts in image generation—leading to iterative revisions and high interaction costs for non-expert users—this paper proposes the Vision-Cooperative Adaptive (VCA) framework, which refines intent through human-AI multi-turn dialogue, iteratively optimizes prompts, and progressively enhances image quality. Key contributions include: (1) the first human-AI cooperative adaptive mechanism for prompt disambiguation; (2) the construction of the first multi-turn prompt–image dialogue dataset with fine-grained intent annotations; and (3) a unified optimization pipeline integrating Retrieval-Augmented Generation (RAG), CLIP-based cross-modal semantic scoring, and Proximal Policy Optimization (PPO) reinforcement learning to jointly optimize semantic disambiguation and pixel-level fidelity. Experiments demonstrate that VCA reduces average dialogue turns to 4.3, achieves a CLIP similarity score of 0.92, and attains a user satisfaction rating of 4.73/5—significantly outperforming DALL·E 3 and Stable Diffusion.

Technology Category

Application Category

📝 Abstract
Modern image generation systems can produce high-quality visuals, yet user prompts often contain ambiguities, requiring multiple revisions. Existing methods struggle to address the nuanced needs of non-expert users. We propose Visual Co-Adaptation (VCA), a novel framework that iteratively refines prompts and aligns generated images with user preferences. VCA employs a fine-tuned language model with reinforcement learning and multi-turn dialogues for prompt disambiguation. Key components include the Incremental Context-Enhanced Dialogue Block for interactive clarification, the Semantic Exploration and Disambiguation Module (SESD) leveraging Retrieval-Augmented Generation (RAG) and CLIP scoring, and the Pixel Precision and Consistency Optimization Module (PPCO) for refining image details using Proximal Policy Optimization (PPO). A human-in-the-loop feedback mechanism further improves performance. Experiments show that VCA surpasses models like DALL-E 3 and Stable Diffusion, reducing dialogue rounds to 4.3, achieving a CLIP score of 0.92, and enhancing user satisfaction to 4.73/5. Additionally, we introduce a novel multi-round dialogue dataset with prompt-image pairs and user intent annotations.
Problem

Research questions and friction points this paper is trying to address.

Image Generation
User Intent Understanding
Dialogue Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Collaborative Adaptation (VCA)
Reinforcement Learning for Image Generation
User Feedback Mechanism
🔎 Similar Papers
No similar papers found.
Yangfan He
Yangfan He
University of Minnesota - Twin Cities
AI AgentReasoningAI AlignmentFoundation Models
Jianhui Wang
Jianhui Wang
University of Electronic Science and Technology of China, Qingshuihe Campus, 2006 Xiyuan Ave, West Hi-Tech Zone, Chengdu, Sichuan 611731, China
K
Kun Li
Xiamen University, 422 Siming South Road, Xiamen, Fujian 361005, China
Yijin Wang
Yijin Wang
undergraduate,Xidian University
machine learning
L
Li Sun
Boston University, Boston, Massachusetts 02215, USA
J
Jun Yin
Shenzhen International Graduate School, Tsinghua University, University Town of Shenzhen, Nanshan District, Shenzhen, Guangdong 518055, China
M
Miao Zhang
Shenzhen International Graduate School, Tsinghua University, University Town of Shenzhen, Nanshan District, Shenzhen, Guangdong 518055, China
Xueqian Wang
Xueqian Wang
Tsinghua University
Information FusionTarget DetectionRadar ImagingImage Processing