🤖 AI Summary
Existing automated graphic design methods struggle to jointly control heterogeneous multi-source conditions—such as images, layouts, and text—resulting in weak cross-condition generalization, coarse-grained sub-condition control, and disharmonious composition. To address this, we propose a unified multi-condition-driven diffusion Transformer architecture, introducing the first cross-modal attention masking mechanism to enable region-level fine-grained condition disentanglement and alignment. We also construct the first large-scale, 400K-sample design dataset with multi-condition annotations and benchmarks, integrated with automated data synthesis and condition-embedding alignment techniques. Experiments demonstrate that our method achieves state-of-the-art performance in three critical dimensions: fidelity to user intent, compositional harmony, and local controllability—significantly outperforming existing multi-condition generation approaches.
📝 Abstract
Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.