๐ค AI Summary
To address the poor spatial localization and limited scalability of text-guided diffusion models in multi-instance complex scene generation, this paper proposes the Janus-Pro and MIGLoRA collaborative framework. We introduce Janus-Proโa novel lightweight (1B-parameter) prompt-to-layout parsing module enabling high-fidelity layout controlโand MIGLoRA, a plug-and-play adapter that unifies LoRA fine-tuning across two distinct diffusion backbones: SD1.5 (UNet-based) and SD3 (DiT-based), with zero architectural modification. We establish the DescripBox benchmark suite on COCO and LVIS, achieving state-of-the-art performance: significant improvements in layout fidelity and generation diversity, superior parameter efficiency over existing methods, and support for open-world image synthesis at 1024ร1024 resolution.
๐ Abstract
Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.