UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models suffer from text style distortion and low recognition accuracy—particularly for small-font and multilingual (especially Chinese) text rendering. To address this, we propose a high-fidelity text generation method tailored for AI-assisted graphic design. Methodologically: (i) we introduce the first end-to-end text style transfer model built upon the DiT architecture; (ii) we construct the first bilingual (Chinese–English) synthetic text-image dataset; and (iii) we design a multimodal conditional encoder integrating a pre-trained text-to-image (T2I) model, an MLLM-based layout planner, and an RGBA transparent foreground generator, enabling background-aware, fully automated Text-to-Design (T2D) synthesis. Experiments demonstrate that our approach achieves state-of-the-art performance among open-source methods in both text accuracy and style consistency, and significantly outperforms commercial closed-source tools in Chinese font fidelity and typographic controllability.

Technology Category

Application Category

📝 Abstract
AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.
Problem

Research questions and friction points this paper is trying to address.

Improves text rendering in design images for small typography and non-Latin scripts.
Enables high-precision stylized text editing and conditional generation in graphic design.
Integrates automated text-to-design pipeline with style consistency and text accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for stylized text editing and generation
DiT-based text style transfer with transparent RGBA text foregrounds
Conditional text synthesis using multi-modal encoder and layout planner
🔎 Similar Papers
No similar papers found.
Y
Yiming Zhao
Wangxuan Institute of Computer Technology, Peking University, China
Y
Yuanpeng Gao
Wangxuan Institute of Computer Technology, Peking University, China
Yuxuan Luo
Yuxuan Luo
City University of Hong Kong
Few shot learningZero shot learningContinual learning
J
Jiwei Duan
Kingsoft Office, China
S
Shisong Lin
Kingsoft Office, China
L
Longfei Xiong
Kingsoft Office, China
Zhouhui Lian
Zhouhui Lian
Peking University
Computer GraphicsComputer VisionAI