๐ค AI Summary
Existing diffusion models struggle with fidelity and relational reasoning when generating images from complex text prompts involving multiple objects, attributes, and spatial/logical relationships.
Method: We propose a training-free, multi-agent collaborative diffusion framework. First, we introduce an MLLM-driven multi-agent scene parsing module that performs fine-grained disentanglement of objects, attributes, and relationships. Second, we design a hierarchical diffusion mechanism leveraging Gaussian spatial masks and region-wise filtering to inject structured priors directly into pre-trained diffusion modelsโwithout fine-tuning.
Results: Our method achieves significant improvements over state-of-the-art approaches across multiple complex prompt benchmarks. It consistently enhances object accuracy, relational plausibility, and fine-grained image fidelity. By decoupling semantic understanding from generation and avoiding model adaptation, our framework establishes a new paradigm for open-domain, high-fidelity, controllable image synthesis.
๐ Abstract
Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.