ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

Current evaluations of large language models predominantly assess outputs as merely correct or incorrect, offering limited insight into the root causes of errors—such as formatting mistakes, computational inaccuracies, or misinterpretations of the prompt—thereby hindering effective model improvement. This work proposes ErrorMap, a method that leverages automated attribution analysis to construct model-specific “failure signatures.” By analyzing 35 datasets and 83 models, we present the first systematic, cross-task, and cross-model error taxonomy, culminating in ErrorAtlas: an extensible error atlas. Our approach not only uncovers previously overlooked high-frequency error types but also shifts the evaluation paradigm from “whether the output is correct” to “why the error occurred,” providing actionable insights for model debugging, benchmark alignment, and model selection. The code and classification framework are publicly released.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique"failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

failure analysis

benchmark limitations

error taxonomy

model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ErrorMap

ErrorAtlas

failure attribution

LLM evaluation

error taxonomy

🔎 Similar Papers

Racing Thoughts: Explaining Large Language Model Contextualization Errors

2024-10-02arXiv.orgCitations: 1

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models

2024-06-13Citations: 5

Evaluation of Geographical Distortions in Language Models: A Crucial Step Towards Equitable Representations

2024-04-26IFIP Working Conference on Database SemanticsCitations: 6