ConsistCompose: Unified Multimodal Layout Control for Image Composition

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified multimodal models excel at visual understanding but suffer from significant limitations in layout-controllable multi-instance image generation (LELG), particularly in achieving precise compositional control. This paper introduces a novel Language-Embedded Layout Generation (LELG) paradigm: it directly encodes layout coordinates into language prompts and integrates coordinate-aware classifier-free guidance with a multimodal large model architecture, enabling joint text-image interleaved input and spatially accurate generation within a single unified interface—without task-specific branches—thus unifying multimodal understanding and generation. To support this, we construct the large-scale ConsistCompose3M dataset. Experiments demonstrate substantial improvements in spatial localization accuracy on COCO-Position and MS-Bench, while maintaining high identity fidelity. Our method achieves state-of-the-art performance in both multi-instance image generation and multimodal understanding.

Technology Category

Application Category

📝 Abstract
Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.
Problem

Research questions and friction points this paper is trying to address.

Enabling layout-controlled multi-instance image generation from multimodal inputs
Translating linguistic layout cues into precise spatial control mechanisms
Improving spatial accuracy while preserving identity fidelity in generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embeds layout coordinates into language prompts
Uses instance-coordinate binding for spatial control
Constructs large-scale dataset for layout-conditioned generation
🔎 Similar Papers
No similar papers found.
X
Xuanke Shi
SenseTime Research
Boxuan Li
Boxuan Li
Microsoft
Big DataLLMagent
X
Xiaoyang Han
SenseTime Research
Z
Zhongang Cai
SenseTime Research
L
Lei Yang
SenseTime Research
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Q
Quan Wang
SenseTime Research