Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 1

career value

177K/year

🤖 AI Summary

In text-to-image (T2I) generation, ambiguous user prompts cause severe misalignment between user intent and model interpretation, necessitating iterative refinement. To address this, we propose the first designer-oriented multi-turn T2I agent that models latent user intent via an *editable belief graph* and employs an *active clarification mechanism* to achieve human–AI collaborative intent alignment. Our contributions are threefold: (1) DesignBench—the first multi-turn T2I benchmark tailored for design workflows; (2) VQAScore—a novel automated evaluation paradigm grounded in visual question answering; and (3) a dual-agent adversarial evaluation and prompt–image alignment training framework. Experiments show VQAScore achieves 2.1× the score of single-turn baselines, and 90% of designers confirm significant improvements in workflow efficiency—demonstrating the effectiveness and practicality of active clarification and interpretable intent modeling.

Technology Category

Application Category

📝 Abstract

User prompts for generative AI models are often underspecified, leading to sub-optimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. Demo: https://github.com/google-deepmind/proactive_t2i_agents.

Problem

Research questions and friction points this paper is trying to address.

Address misalignment between user intent and AI model understanding in text-to-image generation

Propose proactive agents to ask clarifying questions under uncertainty

Improve alignment using editable belief graphs and automated evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive agents ask clarification questions when uncertain

Editable belief graphs represent uncertainty about user intent

Automated evaluation using ground truth and minimal questions

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval