๐ค AI Summary
Text-to-image generation often suffers from output misalignment due to ambiguous user prompts. To address this, we propose a dual-agent synchronous co-adaptive dialogue framework that models image generation as a dynamic, iterative human-AI collaborative optimization process: one agent performs conditional editing and latent-space feedback-driven fine-tuning, while the other models multi-turn dialogue semantics to resolve prompt ambiguity. Our approach is the first to enable joint co-adaptation of the generator and dialogue policy across both latent and semantic spacesโwithout requiring additional annotations or task-specific pretraining. Experiments demonstrate that our method significantly reduces user trial-and-error iterations (by 42% on average), improves intent alignment and visual fidelity, and achieves state-of-the-art performance on multiple human-AI collaborative image generation benchmarks.
๐ Abstract
Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, iterative workflow where an intelligent dialogue agent continuously interacts with the user. Initially, a base image is generated from the user's prompt. Then, through a series of synchronized dialogue exchanges, the system adapts and optimizes the image according to evolving user feedback. The co-adaptive process allows the system to progressively narrow down ambiguities and better align with user intent. Experiments demonstrate that Twin-Co not only enhances user experience by reducing trial-and-error iterations but also improves the quality of the generated images, streamlining the creative process across various applications.