Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether state-of-the-art vision-language models (VLMs) exhibit human-like robustness in visual word recognition under “visible but unreadable” distortions—such as character splitting, fusion, and partial occlusion—that challenge symbolic segmentation, composition, and binding. Method: We introduce the first cross-script, psychophysics-inspired benchmark, generating controlled adversarial text stimuli via glyph segmentation, recombination, and superposition; it incorporates multilingual evaluation, structured prompting, and a rigorous quantitative assessment protocol. Results: Experiments reveal a precipitous performance drop for current VLMs on distorted text, with highly incoherent outputs—exposing their lack of structural priors for symbolic composition and binding, and overreliance on low-level visual invariances. This work provides the first systematic characterization of the cognitive gap in VLMs’ text readability, establishing a novel benchmark and theoretical foundation for modeling symbolic segmentation, composition, and binding mechanisms.

Technology Category

Application Category

📝 Abstract
Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with fragmented and occluded text recognition
Models lack compositional priors for robust literacy across scripts
Performance drops on perturbed glyphs despite human legibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stimuli generation via glyph splicing and overlaying
Benchmarking across Chinese logographs and English alphabetic systems
Analyzing model reliance on visual invariances versus compositional priors
🔎 Similar Papers
No similar papers found.
J
Jie Zhang
CFAR and IPHC, A*STAR
T
Ting Xu
National University of Singapore
Gelei Deng
Gelei Deng
Nanyang Technological University
CybersecuritySystem securityRobotics SecurityAI SecuritySoftware Testing
Runyi Hu
Runyi Hu
Nanyang Technological University
Large Language ModelAI AlignmentWatermarking
Han Qiu
Han Qiu
NTU
T
Tianwei Zhang
Nanyang Technological University
Q
Qing Guo
Nankai University
I
Ivor Tsang
CFAR and IPHC, A*STAR