M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge that existing text-to-image generation models struggle to accurately interpret complex prompts containing multiple constraints. To this end, the authors propose the M3 framework, which introduces multi-agent collaborative reasoning into this task for the first time. M3 orchestrates an iterative, multimodal refinement process through five specialized modules—Planner, Checker, Refiner, Editor, and Verifier—enabling checklist-driven constraint verification and progressive correction of generated outputs without requiring model retraining. By integrating only off-the-shelf foundation models, M3 significantly enhances the performance of open-source systems, achieving a state-of-the-art score of 0.532 on the OneIG-EN benchmark—surpassing Imagen4 and Seedream 3.0—and nearly doubling performance on the GenEval spatial reasoning metric.

Technology Category

Application Category

📝 Abstract

Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

compositional prompts

multi-constraint reasoning

visual fidelity

spatial reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reasoning

compositional text-to-image generation

training-free refinement