Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing

๐Ÿ“… 2025-03-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the poor spatial localization and limited scalability of text-guided diffusion models in multi-instance complex scene generation, this paper proposes the Janus-Pro and MIGLoRA collaborative framework. We introduce Janus-Proโ€”a novel lightweight (1B-parameter) prompt-to-layout parsing module enabling high-fidelity layout controlโ€”and MIGLoRA, a plug-and-play adapter that unifies LoRA fine-tuning across two distinct diffusion backbones: SD1.5 (UNet-based) and SD3 (DiT-based), with zero architectural modification. We establish the DescripBox benchmark suite on COCO and LVIS, achieving state-of-the-art performance: significant improvements in layout fidelity and generation diversity, superior parameter efficiency over existing methods, and support for open-world image synthesis at 1024ร—1024 resolution.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-object scene synthesis in diffusion models
Enhancing text-layout alignment for complex image generation
Enabling parameter-efficient adaptation for diverse resolution outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Janus-Pro-driven Prompt Parsing for layout generation
MIGLoRA integrates LoRA into UNet and DiT backbones
Parameter-efficient plug-and-play adaptability with MIGLoRA
๐Ÿ”Ž Similar Papers
No similar papers found.