ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit significant weaknesses in spatial and positional reasoning, yet these limitations remain poorly characterized—particularly regarding structured textual visualizations like ASCII art, which expose bottlenecks in multimodal representation learning. Method: We introduce ASCIIBench, the first benchmark dedicated to evaluating LLMs’ comprehension of ASCII art, comprising 5,315 annotated ASCII images spanning generation and classification tasks. We curate a high-quality ASCII dataset, fine-tune CLIP to accommodate symbolic visual modalities, and analyze embedding separability via cosine similarity. Contribution/Results: Empirical analysis reveals that standard CLIP embeddings achieve near-random performance across most categories, with discriminability observed only in highly cohesive classes—indicating a fundamental representational bottleneck, not a generative one. This work uncovers a critical limitation of current cross-modal models in structuring symbolic visual representations and establishes ASCIIBench as a foundational benchmark and empirical basis for developing novel metrics and modeling paradigms for symbolic vision.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at https://github.com/ASCIIBench/ASCIIBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' spatial reasoning via ASCII art
Assessing ASCII image generation and classification capabilities
Identifying representation bottlenecks in multimodal symbolic understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

ASCIIBench benchmark for ASCII art evaluation
Fine-tuned CLIP model for ASCII structure analysis
Reveals representation bottleneck in multimodal embeddings
🔎 Similar Papers
No similar papers found.
K
Kerry Luo
Algoverse AI Research
Michael Fu
Michael Fu
The University of Melbourne
Software EngineeringDevSecOpsDeep LearningLanguage Models
J
Joshua Peguero
Algoverse AI Research
H
Husnain Malik
Algoverse AI Research
A
Anvay Patil
Algoverse AI Research
J
Joyce Lin
Algoverse AI Research
M
Megan Van Overborg
Algoverse AI Research
R
Ryan Sarmiento
Algoverse AI Research
Kevin Zhu
Kevin Zhu
PhD, Stanford University; Professor of Business+Technology, University of California, San Diego
ITdatae-commercesoftwaredigital transformation