🤖 AI Summary
In text-to-image (T2I) generation, ambiguous user prompts cause severe misalignment between user intent and model interpretation, necessitating iterative refinement. To address this, we propose the first designer-oriented multi-turn T2I agent that models latent user intent via an *editable belief graph* and employs an *active clarification mechanism* to achieve human–AI collaborative intent alignment. Our contributions are threefold: (1) DesignBench—the first multi-turn T2I benchmark tailored for design workflows; (2) VQAScore—a novel automated evaluation paradigm grounded in visual question answering; and (3) a dual-agent adversarial evaluation and prompt–image alignment training framework. Experiments show VQAScore achieves 2.1× the score of single-turn baselines, and 90% of designers confirm significant improvements in workflow efficiency—demonstrating the effectiveness and practicality of active clarification and interpretable intent modeling.
📝 Abstract
User prompts for generative AI models are often underspecified, leading to sub-optimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. Demo: https://github.com/google-deepmind/proactive_t2i_agents.