STRICT: Stress Test of Rendering Images Containing Text

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models generate photorealistic images but struggle significantly with synthesizing coherent, legible, and instruction-aligned text within images. To address this, we introduce the first systematic stress-testing benchmark targeting three critical dimensions: maximum renderable text length, OCR recognition accuracy, and instruction-following fidelity. We propose a multidimensional quantitative evaluation framework, featuring controllable prompt engineering, structured instruction sampling, OCR-based verification, and automated metric computation—forming an end-to-end assessment pipeline. Extensive experiments across multiple state-of-the-art diffusion models reveal sharp degradation in text legibility beyond eight characters and a 42% instruction violation rate; open-source models consistently underperform closed-source counterparts. This work is the first to expose fundamental limitations in long-range cross-modal text consistency and instruction adherence in diffusion-based image generation, thereby advancing diagnostic evaluation paradigms for multimodal text rendering.

Technology Category

Application Category

📝 Abstract
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $ extbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.
Problem

Research questions and friction points this paper is trying to address.

Diffusion models struggle to generate consistent, legible text in images
Existing methods fail to model long-range spatial dependencies effectively
No systematic benchmark tests text rendering capabilities in diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for testing text rendering in diffusion models
Evaluates text length, correctness, and instruction adherence
Identifies limitations in long-range consistency and instruction-following
🔎 Similar Papers
No similar papers found.
T
Tianyu Zhang
DIRO, Université de Montréal, Mila - Quebec AI Institute
X
Xinyu Wang
McGill University
Zhenghan Tai
Zhenghan Tai
University of Toronto
Information RetrievalLarge Language ModelRetrieval Augmented Generation
L
Lu Li
University of Pennsylvania
J
Jijun Chi
University of Toronto
J
Jingrui Tian
University of California, Los Angeles
Hailin He
Hailin He
Unknown affiliation
Suyuchen Wang
Suyuchen Wang
Université de Montréal / Mila
NLPLLMVLMDeep Learning