🤖 AI Summary
Existing diffusion models (e.g., TextDiffuser-2) suffer from low layout efficiency, poor hardware adaptability, and high computational resource consumption in text-embedded image generation. To address these issues, this work proposes a two-stage lightweight framework: (1) a reinforcement learning (PPO)-based text bounding-box layout optimization stage that guarantees zero overlap and enables real-time generation on both CPU and GPU; and (2) a high-fidelity image synthesis stage built upon the TextDiffuser-2 architecture. To our knowledge, this is the first work to introduce reinforcement learning for text layout optimization. The resulting model weighs only 2 MB and supports edge-device deployment. On the MARIOEval benchmark, it achieves state-of-the-art OCR accuracy and CLIPScore while accelerating inference by 97.64%.
📝 Abstract
Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2's quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.