Enhancing Intent Understanding for Ambiguous Prompts through Human-Machine Co-Adaptation

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address ambiguous user prompts in image generation—leading to iterative revisions and high interaction costs for non-expert users—this paper proposes the Vision-Cooperative Adaptive (VCA) framework, which refines intent through human-AI multi-turn dialogue, iteratively optimizes prompts, and progressively enhances image quality. Key contributions include: (1) the first human-AI cooperative adaptive mechanism for prompt disambiguation; (2) the construction of the first multi-turn prompt–image dialogue dataset with fine-grained intent annotations; and (3) a unified optimization pipeline integrating Retrieval-Augmented Generation (RAG), CLIP-based cross-modal semantic scoring, and Proximal Policy Optimization (PPO) reinforcement learning to jointly optimize semantic disambiguation and pixel-level fidelity. Experiments demonstrate that VCA reduces average dialogue turns to 4.3, achieves a CLIP similarity score of 0.92, and attains a user satisfaction rating of 4.73/5—significantly outperforming DALL·E 3 and Stable Diffusion.

Technology Category

Application Category

📝 Abstract

Modern image generation systems can produce high-quality visuals, yet user prompts often contain ambiguities, requiring multiple revisions. Existing methods struggle to address the nuanced needs of non-expert users. We propose Visual Co-Adaptation (VCA), a novel framework that iteratively refines prompts and aligns generated images with user preferences. VCA employs a fine-tuned language model with reinforcement learning and multi-turn dialogues for prompt disambiguation. Key components include the Incremental Context-Enhanced Dialogue Block for interactive clarification, the Semantic Exploration and Disambiguation Module (SESD) leveraging Retrieval-Augmented Generation (RAG) and CLIP scoring, and the Pixel Precision and Consistency Optimization Module (PPCO) for refining image details using Proximal Policy Optimization (PPO). A human-in-the-loop feedback mechanism further improves performance. Experiments show that VCA surpasses models like DALL-E 3 and Stable Diffusion, reducing dialogue rounds to 4.3, achieving a CLIP score of 0.92, and enhancing user satisfaction to 4.73/5. Additionally, we introduce a novel multi-round dialogue dataset with prompt-image pairs and user intent annotations.

Problem

Research questions and friction points this paper is trying to address.

Image Generation

User Intent Understanding

Dialogue Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Collaborative Adaptation (VCA)

Reinforcement Learning for Image Generation

User Feedback Mechanism

🔎 Similar Papers

No similar papers found.