Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Existing interpretability methods for analyzing failures in large language models (LLMs) are often limited to short or idealized inputs, hindering insights into failure mechanisms on real-world benchmark tasks. This work formalizes failure analysis as contrastive attribution by computing logit differences between erroneous outputs and correct alternatives, then attributes these differences to input tokens and internal model states using Layer-wise Relevance Propagation (LRP). The approach constructs cross-layer attribution graphs to enable fine-grained analysis over long contexts. We introduce an efficient extension for contrastive attribution and present the first systematic evaluation of LRP’s applicability and limitations across diverse datasets, model scales, and training stages. Experiments demonstrate that the method yields actionable diagnostic signals in certain failure cases, while also revealing practical boundaries of current attribution techniques, thereby offering a new perspective and empirical foundation for LLM interpretability research.

Technology Category

Application Category

📝 Abstract

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as \textit{contrastive attribution}, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.

Problem

Research questions and friction points this paper is trying to address.

interpretability

Large Language Models

failure analysis

realistic benchmarks

contrastive attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive attribution

LRP-based attribution

LLM interpretability