🤖 AI Summary
Diffusion models suffer from text style distortion and low recognition accuracy—particularly for small-font and multilingual (especially Chinese) text rendering. To address this, we propose a high-fidelity text generation method tailored for AI-assisted graphic design. Methodologically: (i) we introduce the first end-to-end text style transfer model built upon the DiT architecture; (ii) we construct the first bilingual (Chinese–English) synthetic text-image dataset; and (iii) we design a multimodal conditional encoder integrating a pre-trained text-to-image (T2I) model, an MLLM-based layout planner, and an RGBA transparent foreground generator, enabling background-aware, fully automated Text-to-Design (T2D) synthesis. Experiments demonstrate that our approach achieves state-of-the-art performance among open-source methods in both text accuracy and style consistency, and significantly outperforms commercial closed-source tools in Chinese font fidelity and typographic controllability.
📝 Abstract
AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.