Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

The contextual utilization mechanisms of language models remain opaque, making it difficult for users to distinguish whether model responses stem from parametric knowledge or input context—and to identify the critical context segments responsible. Method: We propose the first gold-standard evaluation framework for highlight explanations (HEs), built upon a benchmark dataset with human-annotated, ground-truth context attribution labels—thereby overcoming the indirectness of conventional proxy metrics. The framework enables systematic evaluation across multiple models, datasets, and realistic scenarios, including long-context settings and positional biases. Contribution/Results: We evaluate three state-of-the-art HE methods alongside MechLight, a novel mechanism-based interpretability method specifically designed for context attribution. Results show MechLight achieves overall best performance; however, all methods exhibit substantial limitations in long-context reasoning and positional sensitivity, exposing fundamental bottlenecks hindering practical deployment of current HE techniques.

Technology Category

Application Category

📝 Abstract

Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

Problem

Research questions and friction points this paper is trying to address.

Evaluating highlight explanations for context attribution in language models

Assessing explanation accuracy across contexts, datasets and model architectures

Identifying positional biases and limitations in current explanation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced gold standard evaluation framework for highlight explanations

Adapted mechanistic interpretability approach MechLight for context attribution

Evaluated four explanation methods across multiple datasets and models

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models