LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To bridge the significant gap between prompt expressiveness and text rendering fidelity in text-to-image generation, this paper introduces LeX, a full-stack synthesis paradigm. Methodologically, it constructs LeX-10K—a high-fidelity Chinese text-image dataset; designs LeX-Enhancer, a prompt augmentation model; and develops two complementary generative architectures—LeX-FLUX (diffusion-based) and LeX-Lumina (flow-matching-based). Contributions include the first text precision metric, PNED (Prompt-Normalized Edit Distance), and the comprehensive evaluation benchmark LeX-Bench, integrating aesthetic modeling, spatially-aware alignment, and font rendering optimization. Experiments show LeX-Lumina achieves a 79.81% PNED improvement on CreateBench, while LeX-FLUX outperforms baselines by 3.18%, 4.45%, and 3.81% in color, positional, and font accuracy, respectively. All code, models, and data are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$ imes$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Bridging gap between prompt expressiveness and text rendering fidelity
Creating high-quality text-image synthesis dataset and models
Developing benchmark for systematic text generation evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality data synthesis pipeline using Deepseek-R1
Prompt enrichment model LeX-Enhancer for improved expressiveness
State-of-the-art text-to-image models LeX-FLUX and LeX-Lumina
🔎 Similar Papers
No similar papers found.
Shitian Zhao
Shitian Zhao
Shanghai AI Lab
LLMMLLMGenerative Model
Q
Qilong Wu
Shanghai AI Laboratory
X
Xinyue Li
Shanghai AI Laboratory
B
Bo Zhang
Shanghai AI Laboratory
M
Ming Li
Shanghai AI Laboratory
Q
Qi Qin
Shanghai AI Laboratory
Dongyang Liu
Dongyang Liu
MMLab CUHK
Image/Video GenerationLLMsVLMs
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC
H
Hongsheng Li
The Chinese University of Hong Kong
Y
Yu Qiao
Shanghai AI Laboratory
P
Peng Gao
Shanghai AI Laboratory
B
Bin Fu
Shanghai AI Laboratory
Z
Zhen Li
The Chinese University of Hong Kong