AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-subject image customization must simultaneously satisfy text alignment, subject identity fidelity, and layout controllability—yet existing methods struggle to balance all three and rely on costly additional training, limiting scalability and efficiency. This paper introduces the first training-free framework for multi-subject customization, enabling joint generation solely from text prompts, subject reference images, and spatial layout constraints. Our core innovation is a bottom-up, two-level attention decoupling mechanism: a global layer ensures text–image semantic alignment, while a local layer independently models features within each subject region—marking the first separation in diffusion models between text–visual cross-attention and subject-specific attention. Furthermore, we adapt a pre-trained image adapter without fine-tuning to align subject features. Experiments demonstrate state-of-the-art performance, supporting complex compositions with over five subjects while significantly improving identity fidelity, layout compliance, and text consistency.

Technology Category

Application Category

📝 Abstract
Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
Problem

Research questions and friction points this paper is trying to address.

Synthesizes multiple subjects into a coherent image
Balances text alignment, identity preservation, and layout control
Eliminates need for training to improve scalability and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for multi-subject customization
Bottom-up dual-level attention decoupling mechanism
Pre-trained image adapters for feature extraction
🔎 Similar Papers
No similar papers found.
B
Binhe Yu
Zhejiang University
Z
Zhen Wang
HKUST
K
Kexin Li
Zhejiang Tobacco Monopoly Administration
Yuqian Yuan
Yuqian Yuan
PhD student, Zhejiang University
Computer VisionMachine Learning
W
Wenqiao Zhang
Zhejiang University
L
Long Chen
HKUST
Juncheng Li
Juncheng Li
East China Normal University
Super ResolutionImage RestorationComputer VisionMedical Image Analysis
J
Jun Xiao
Zhejiang University
Y
Yueting Zhuang
Zhejiang University