Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates the true capability of vision-language models (VLMs) on fine-grained font recognition—a task largely overlooked in prior research. To address the absence of standardized benchmarks, we introduce FRB, the first dedicated font recognition benchmark, comprising both “easy” and “hard” subsets; the latter incorporates Stroop-effect stimuli to decouple glyph texture from lexical semantics. We conduct systematic zero-shot and few-shot evaluations across multiple VLMs, augment with chain-of-thought prompting, and perform cross-model attention visualization. Results reveal that state-of-the-art VLMs achieve <30% accuracy on FRB, with few-shot learning and CoT prompting yielding negligible improvement. Attention maps consistently fail to localize glyph-level texture features, indicating strong reliance on textual semantics rather than visual typography. This study is the first to systematically expose the fundamental failure of VLMs in font recognition, proposes a cognitive-psychology-inspired paradigm for hard-sample construction, and establishes a novel methodological framework for evaluating fine-grained visual understanding.

Technology Category

Application Category

📝 Abstract

Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' capability in fine-grained font recognition

Evaluating VLMs' performance on structured Font Recognition Benchmark

Investigating limitations of VLMs in capturing semantic font features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Font Recognition Benchmark (FRB) dataset

Evaluates VLMs with easy and hard font versions

Uses attention analysis to reveal VLM limitations

🔎 Similar Papers

GlyphPattern: An Abstract Pattern Recognition for Vision-Language Models