MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

๐Ÿ“… 2025-05-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing diffusion models struggle with fidelity and relational reasoning when generating images from complex text prompts involving multiple objects, attributes, and spatial/logical relationships. Method: We propose a training-free, multi-agent collaborative diffusion framework. First, we introduce an MLLM-driven multi-agent scene parsing module that performs fine-grained disentanglement of objects, attributes, and relationships. Second, we design a hierarchical diffusion mechanism leveraging Gaussian spatial masks and region-wise filtering to inject structured priors directly into pre-trained diffusion modelsโ€”without fine-tuning. Results: Our method achieves significant improvements over state-of-the-art approaches across multiple complex prompt benchmarks. It consistently enhances object accuracy, relational plausibility, and fine-grained image fidelity. By decoupling semantic understanding from generation and avoiding model adaptation, our framework establishes a new paradigm for open-domain, high-fidelity, controllable image synthesis.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.
Problem

Research questions and friction points this paper is trying to address.

Handling complex prompts with multiple objects and relations
Improving accuracy and fidelity in complex scene generation
Enhancing text-to-image diffusion models without additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent collaboration for scene parsing
Hierarchical Compositional diffusion with Gaussian mask
Training-free performance enhancement for complex scenes
๐Ÿ”Ž Similar Papers
No similar papers found.
Mingcheng Li
Mingcheng Li
Fudan University
Xiaolu Hou
Xiaolu Hou
Faculty of Informatics and Information Technologies, Slovak University of Technology, Slovakia
Cryptography Hardware SecurityAI Security
Ziyang Liu
Ziyang Liu
Research Fellow, Harvard Medical School; PhD, Tsinghua University
AI4BioGraph EmbeddingLarge Language Model
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI
Z
Ziyun Qian
Academy for Engineering and Technology, Fudan University; Cognition and Intelligent Technology Laboratory (CIT Lab)
J
Jiawei Chen
Academy for Engineering and Technology, Fudan University; Cognition and Intelligent Technology Laboratory (CIT Lab)
Jinjie Wei
Jinjie Wei
Fudan University
Large Language Model
Y
Yue Jiang
Academy for Engineering and Technology, Fudan University; Cognition and Intelligent Technology Laboratory (CIT Lab)
Q
Qingyao Xu
Academy for Engineering and Technology, Fudan University; Cognition and Intelligent Technology Laboratory (CIT Lab)
Lihua Zhang
Lihua Zhang
Wuhan University
computational biologybioinformaticsdata mining