🤖 AI Summary
This work addresses the degradation in visual text rendering (VTR) quality caused by structural anomalies—such as distortion, misalignment, and blurriness—which existing multimodal large language models (MLLMs) and OCR systems struggle to perceive, thereby hindering effective reinforcement learning optimization. To overcome this limitation, we propose TextPecker, a plug-and-play, structure-aware reinforcement learning framework that, for the first time, enables fine-grained, character-level quantification of structural anomalies. We further introduce a dedicated annotated dataset and a stroke-editing synthesis engine to generate reliable reward signals for VTR training. Experimental results demonstrate that TextPecker significantly enhances text fidelity in general-purpose text-to-image models, achieving an average improvement of 4% in structural fidelity and 8.7% in Chinese semantic alignment on models such as Qwen-Image, establishing a new state of the art in VTR.
📝 Abstract
Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.