Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Current vision-language models, while capable of recognizing textual content in images, generally lack sensitivity to typographic attributes such as font type, size, style, and color. This work presents the first systematic evaluation of 15 state-of-the-art models across four scripts, 26 fonts, and three difficulty levels, revealing that this performance gap stems primarily from insufficient typographic diversity in training data rather than model capacity limitations. To address this, we introduce a synthetic-data-based benchmark for typography-aware evaluation and apply LoRA-based fine-tuning to enhance open-source models. Experiments demonstrate that the fine-tuned models surpass the best closed-source system in font size recognition and achieve substantially improved overall typographic perception, though font style recognition remains challenging.

Technology Category

Application Category

📝 Abstract

Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.

Problem

Research questions and friction points this paper is trying to address.

typography gap

vision-language models

font recognition

visual perception

text appearance

Innovation

Methods, ideas, or system contributions that make the work stand out.

typography perception

vision-language models

font recognition