Probing for Knowledge Attribution in Large Language Models

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Large language models often conflate knowledge sources, struggling to distinguish whether their outputs derive from user-provided context or internal parametric memory—a limitation that frequently leads to factual errors. To address this, this work introduces the concept of "contributory attribution" and presents AttriWiki, a self-supervised data pipeline that generates high-quality attribution labels without manual annotation. The authors train lightweight linear probes on hidden representations to classify the source of model knowledge. Combining self-supervised learning, hidden-state analysis, and cross-domain transfer, the method achieves a Macro-F1 score of 0.96 on mainstream models such as Llama-3.1-8B, with cross-domain performance consistently ranging from 0.94 to 0.99. Experiments further reveal that attribution errors can increase downstream task error rates by up to 70%, underscoring the critical importance of accurate knowledge attribution.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.

Problem

Research questions and friction points this paper is trying to address.

knowledge attribution

hallucination

large language models

faithfulness

factuality

Innovation

Methods, ideas, or system contributions that make the work stand out.

contributive attribution

knowledge probing

self-supervised data pipeline