Explaining the Reasoning of Large Language Models Using Attribution Graphs

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language models (LLMs) suffer from opaque inference processes, hindering their safe and trustworthy deployment. Existing contextual attribution methods model only direct associations between prompts and output tokens, neglecting causal dependencies across tokens in the autoregressive generation chain—leading to incomplete explanations. To address this, we propose CAGE, a novel causal attribution framework that explicitly models multi-hop dependencies between prompts and individual generated tokens via an attribution graph. CAGE introduces path marginalization to quantify the cumulative influence of intermediate token generations—an innovation not previously realized. Additionally, it enforces row-stochastic matrix constraints to ensure attribution consistency. Evaluated across multiple LLMs, datasets, and metrics, CAGE significantly improves attribution faithfulness, achieving an average gain of 40% over baselines and effectively overcoming the limitations of prior methods in modeling dynamic generative behavior.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.

Problem

Research questions and friction points this paper is trying to address.

Explaining opaque reasoning of large language models

Overcoming incomplete context attributions in autoregressive LLMs

Quantifying inter-generational influence in model generations

Innovation

Methods, ideas, or system contributions that make the work stand out.

CAGE framework uses attribution graphs for LLM reasoning

Graph quantifies prompt and prior generation influences

Marginalizes contributions along paths to improve faithfulness

🔎 Similar Papers

Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification