LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of jointly preserving layout awareness and content consistency in multi-reference image controllable synthesis, this paper introduces the first training-free zero-shot framework for multi-reference diffusion-based synthesis. Methodologically, it proposes two plug-and-play attention mechanisms—Group Isolation Attention and Region-Modulated Attention—integrated into the MMDiT architecture to enable entity disentanglement and region-level layout control. Additionally, it establishes three novel evaluation metrics: IN-R (instance-region alignment), FI-R (foreground-instance relevance), and BG-S (background similarity). Experiments demonstrate state-of-the-art performance across established benchmarks—including ID-S (identity similarity), BG-S, IN-R, and AVG—particularly excelling in the DPG (detailed prompt grounding) metric on complex synthesis tasks. The framework significantly improves identity preservation, background consistency, and prompt adherence, thereby establishing a new paradigm for training-free multi-image synthesis.

Technology Category

Application Category

📝 Abstract
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.
Problem

Research questions and friction points this paper is trying to address.

Generating coherent multi-reference images with layout awareness
Extending single-reference diffusion models without training
Enhancing layout control and background consistency in synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends diffusion models to multi-reference scenarios
Introduces Group Isolation Attention for entity disentanglement
Utilizes Region-Modulated Attention for layout-aware generation
Y
Yuzhuo Chen
University of Science and Technology of China
Zehua Ma
Zehua Ma
University of Science and Technology of China
Image WatermarkingImage Processing3D PrintingAesthetic 2D Barcode
J
Jianhua Wang
Onestory Team
Kai Kang
Kai Kang
Apple
computer visiondeep learningvideo analysisobject detectionmultimodal LLM
S
Shunyu Yao
Onestory Team
W
Weiming Zhang
University of Science and Technology of China