π€ AI Summary
Large language models often conflate knowledge sources, struggling to distinguish whether their outputs derive from user-provided context or internal parametric memoryβa limitation that frequently leads to factual errors. To address this, this work introduces the concept of "contributory attribution" and presents AttriWiki, a self-supervised data pipeline that generates high-quality attribution labels without manual annotation. The authors train lightweight linear probes on hidden representations to classify the source of model knowledge. Combining self-supervised learning, hidden-state analysis, and cross-domain transfer, the method achieves a Macro-F1 score of 0.96 on mainstream models such as Llama-3.1-8B, with cross-domain performance consistently ranging from 0.94 to 0.99. Experiments further reveal that attribution errors can increase downstream task error rates by up to 70%, underscoring the critical importance of accurate knowledge attribution.
π Abstract
Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.