🤖 AI Summary
Large language models (LLMs) exhibit significant weaknesses in spatial and positional reasoning, yet these limitations remain poorly characterized—particularly regarding structured textual visualizations like ASCII art, which expose bottlenecks in multimodal representation learning. Method: We introduce ASCIIBench, the first benchmark dedicated to evaluating LLMs’ comprehension of ASCII art, comprising 5,315 annotated ASCII images spanning generation and classification tasks. We curate a high-quality ASCII dataset, fine-tune CLIP to accommodate symbolic visual modalities, and analyze embedding separability via cosine similarity. Contribution/Results: Empirical analysis reveals that standard CLIP embeddings achieve near-random performance across most categories, with discriminability observed only in highly cohesive classes—indicating a fundamental representational bottleneck, not a generative one. This work uncovers a critical limitation of current cross-modal models in structuring symbolic visual representations and establishes ASCIIBench as a foundational benchmark and empirical basis for developing novel metrics and modeling paradigms for symbolic vision.
📝 Abstract
Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at https://github.com/ASCIIBench/ASCIIBench.