🤖 AI Summary
Current generative models exhibit insufficient OCR capabilities in text-to-image generation and editing, particularly in producing legible, accurate, and layout-preserving textual content. Method: We propose the first systematic evaluation paradigm for “OCR-aware generation,” comprising 33 tasks across five real-world domains—document, handwritten, scene, artistic, and complex-layout images—and advocate photorealistic text generation as a foundational capability for general-purpose multimodal models. We introduce a customized input-prompt co-design mechanism and a multidimensional evaluation protocol integrating OCR accuracy metrics (e.g., CER, WER) with visual fidelity metrics (e.g., CLIP-Score, FID). Contribution/Results: Extensive benchmarking across six leading open- and closed-source models reveals critical deficiencies in character-level accuracy and structural layout preservation. Our analysis identifies language-vision alignment as the primary bottleneck limiting OCR generation performance, establishing a reproducible benchmark and concrete optimization directions for next-generation multimodal foundation models.
📝 Abstract
Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (emph{e.g.}, Flux-series) and unified generative models (emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex & layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.