ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of large language models predominantly assess outputs as merely correct or incorrect, offering limited insight into the root causes of errors—such as formatting mistakes, computational inaccuracies, or misinterpretations of the prompt—thereby hindering effective model improvement. This work proposes ErrorMap, a method that leverages automated attribution analysis to construct model-specific “failure signatures.” By analyzing 35 datasets and 83 models, we present the first systematic, cross-task, and cross-model error taxonomy, culminating in ErrorAtlas: an extensible error atlas. Our approach not only uncovers previously overlooked high-frequency error types but also shifts the evaluation paradigm from “whether the output is correct” to “why the error occurred,” providing actionable insights for model debugging, benchmark alignment, and model selection. The code and classification framework are publicly released.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique"failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
failure analysis
benchmark limitations
error taxonomy
model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ErrorMap
ErrorAtlas
failure attribution
LLM evaluation
error taxonomy
🔎 Similar Papers
No similar papers found.