TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models (e.g., TextDiffuser-2) suffer from low layout efficiency, poor hardware adaptability, and high computational resource consumption in text-embedded image generation. To address these issues, this work proposes a two-stage lightweight framework: (1) a reinforcement learning (PPO)-based text bounding-box layout optimization stage that guarantees zero overlap and enables real-time generation on both CPU and GPU; and (2) a high-fidelity image synthesis stage built upon the TextDiffuser-2 architecture. To our knowledge, this is the first work to introduce reinforcement learning for text layout optimization. The resulting model weighs only 2 MB and supports edge-device deployment. On the MARIOEval benchmark, it achieves state-of-the-art OCR accuracy and CLIPScore while accelerating inference by 97.64%.

Technology Category

Application Category

📝 Abstract
Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.
Problem

Research questions and friction points this paper is trying to address.

Optimize text layout for efficient text-to-image synthesis
Reduce resource usage in text-embedded image generation
Improve runtime speed and flexibility on CPU/GPU platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes text layout
Two-stage pipeline enhances efficiency
Runs fast on both CPU and GPU
🔎 Similar Papers
No similar papers found.
K
Kazi Mahathir Rahman
BRAC University, Dhaka, Bangladesh
Showrin Rahman
Showrin Rahman
Student of CS,Brac University
Deep learningMachine LearningComputer vision
S
Sharmin Sultana Srishty
BRAC University, Dhaka, Bangladesh