Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the poor spatial localization and limited scalability of text-guided diffusion models in multi-instance complex scene generation, this paper proposes the Janus-Pro and MIGLoRA collaborative framework. We introduce Janus-Pro—a novel lightweight (1B-parameter) prompt-to-layout parsing module enabling high-fidelity layout control—and MIGLoRA, a plug-and-play adapter that unifies LoRA fine-tuning across two distinct diffusion backbones: SD1.5 (UNet-based) and SD3 (DiT-based), with zero architectural modification. We establish the DescripBox benchmark suite on COCO and LVIS, achieving state-of-the-art performance: significant improvements in layout fidelity and generation diversity, superior parameter efficiency over existing methods, and support for open-world image synthesis at 1024×1024 resolution.

Technology Category

Application Category

📝 Abstract

Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.

Problem

Research questions and friction points this paper is trying to address.

Improving multi-object scene synthesis in diffusion models

Enhancing text-layout alignment for complex image generation

Enabling parameter-efficient adaptation for diverse resolution outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Janus-Pro-driven Prompt Parsing for layout generation

MIGLoRA integrates LoRA into UNet and DiT backbones

Parameter-efficient plug-and-play adaptability with MIGLoRA

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models