T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Text-to-image generation models exhibit high sensitivity to prompt formulations, and existing controllable methods typically require additional model training, limiting generalizability and usability. To address this, we propose the first training-free, multi-agent collaborative framework that integrates multimodal large language models (MLLMs) to enable prompt parsing, cross-model automatic scheduling, semantics-driven iterative refinement, and closed-loop quality assessment. The system supports both fully autonomous operation and human-in-the-loop intervention, substantially reducing user burden in prompt engineering while improving text-image alignment and generation fidelity. Evaluated on GenAI-Bench, our method achieves VQA scores competitive with RecraftV3 and Imagen 3, outperforming FLUX1.1-pro by 6.17% while incurring only 16.59% of its inference cost; it also surpasses FLUX.1-dev and SD 3.5 Large.

Technology Category

Application Category

📝 Abstract

Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.

Problem

Research questions and friction points this paper is trying to address.

Enhances prompt interpretation for text-to-image generation

Reduces need for manual prompt refinement

Improves text-image alignment without additional training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free multi-agent system for T2I generation

Automates prompt phrasing and model selection

Enhances text-image alignment and generation quality

🔎 Similar Papers

Jailbreaking Text-to-Image Models with LLM-Based Agents