UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the consistency challenge in multimodal-conditioned image generation—where text, spatial layouts, and subject images jointly guide synthesis—this paper proposes a unified controllable generation framework built upon the DiT architecture. Methodologically, it introduces (1) Conditional MMDiT Attention, a novel mechanism enabling dynamic cross-modal alignment and fusion of heterogeneous conditions; (2) a dual-path architecture—comprising both training-agnostic and training-tunable branches—to simultaneously support zero-shot generalization and fine-tuning adaptability; and (3) SubjectSpatial200K, the first large-scale dataset encompassing both subject-driven and spatially aligned multimodal conditioning. Integrated with trainable LoRA modules and a multi-condition fusion scheduling pipeline, the framework achieves state-of-the-art performance across diverse multimodal generation tasks, significantly improving cross-modal consistency and generation versatility.

Technology Category

Application Category

📝 Abstract
With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of effectively combining multiple conditional inputs while maintaining consistency with all of them remains unsolved. To address this, we introduce UniCombine, a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions, including but not limited to text prompts, spatial maps, and subject images. Specifically, we introduce a novel Conditional MMDiT Attention mechanism and incorporate a trainable LoRA module to build both the training-free and training-based versions. Additionally, we propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks covering both the subject-driven and spatially-aligned conditions. Extensive experimental results on multi-conditional generation demonstrate the outstanding universality and powerful capability of our approach with state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Combining multiple conditional inputs effectively in image generation.
Maintaining consistency across diverse conditions like text and spatial maps.
Developing a unified framework for multi-conditional generative tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-based multi-conditional generative framework
Conditional MMDiT Attention mechanism
SubjectSpatial200K dataset for multi-conditional tasks
🔎 Similar Papers
No similar papers found.