ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant weaknesses in spatial and positional reasoning, yet these limitations remain poorly characterized—particularly regarding structured textual visualizations like ASCII art, which expose bottlenecks in multimodal representation learning. Method: We introduce ASCIIBench, the first benchmark dedicated to evaluating LLMs’ comprehension of ASCII art, comprising 5,315 annotated ASCII images spanning generation and classification tasks. We curate a high-quality ASCII dataset, fine-tune CLIP to accommodate symbolic visual modalities, and analyze embedding separability via cosine similarity. Contribution/Results: Empirical analysis reveals that standard CLIP embeddings achieve near-random performance across most categories, with discriminability observed only in highly cohesive classes—indicating a fundamental representational bottleneck, not a generative one. This work uncovers a critical limitation of current cross-modal models in structuring symbolic visual representations and establishes ASCIIBench as a foundational benchmark and empirical basis for developing novel metrics and modeling paradigms for symbolic vision.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at https://github.com/ASCIIBench/ASCIIBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' spatial reasoning via ASCII art

Assessing ASCII image generation and classification capabilities

Identifying representation bottlenecks in multimodal symbolic understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

ASCIIBench benchmark for ASCII art evaluation

Fine-tuned CLIP model for ASCII structure analysis

Reveals representation bottleneck in multimodal embeddings

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling