🤖 AI Summary
Addressing the challenge of jointly preserving layout awareness and content consistency in multi-reference image controllable synthesis, this paper introduces the first training-free zero-shot framework for multi-reference diffusion-based synthesis. Methodologically, it proposes two plug-and-play attention mechanisms—Group Isolation Attention and Region-Modulated Attention—integrated into the MMDiT architecture to enable entity disentanglement and region-level layout control. Additionally, it establishes three novel evaluation metrics: IN-R (instance-region alignment), FI-R (foreground-instance relevance), and BG-S (background similarity). Experiments demonstrate state-of-the-art performance across established benchmarks—including ID-S (identity similarity), BG-S, IN-R, and AVG—particularly excelling in the DPG (detailed prompt grounding) metric on complex synthesis tasks. The framework significantly improves identity preservation, background consistency, and prompt adherence, thereby establishing a new paradigm for training-free multi-image synthesis.
📝 Abstract
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.