Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of verifiable provenance in large language model (LLM) outputs by proposing Active Indexing—a retrieval-free, training-time indexing method. Instead of relying on external retrieval at inference, it performs continual pretraining using synthetically generated, multi-form question-answer pairs to jointly model bidirectional generation: “source document → factual statement” and “factual statement → source document.” This overcomes the generalization and paraphrase robustness limitations of conventional passive citation tagging. The methodology integrates continual pretraining with instruction tuning, synthetic data augmentation, and bidirectional generative modeling. We introduce CitePretrainBench—the first open-source benchmark for citation-aware pretraining—covering Wikipedia, Common Crawl, arXiv, and novel documents. Evaluations on Qwen2.5-3B and Qwen2.5-7B show up to a 30.2% absolute improvement in citation precision; performance scales linearly with synthetic data volume, with consistent gains even at 16× token budget.

Technology Category

Application Category

📝 Abstract
Trustworthy language models should provide both correct and verifiable answers. While language models can sometimes attribute their outputs to pretraining data, their citations are often unreliable due to hallucination. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during (continual) pretraining--without test-time retrieval--by revising the training process. To evaluate this, we release CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel, unseen documents and probes both short-form (single fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to bind facts to persistent document identifiers, and (2) instruction tuning to elicit citation behavior. We find that simple Passive Indexing, which appends an identifier to each document, helps memorize verbatim text but fails on paraphrased or compositional facts. Instead, we propose Active Indexing, which continually pretrains on synthetic QA pairs that (1) restate each fact in diverse compositional forms, and (2) require bidirectional source-to-fact and fact-to-source generation, jointly teaching the model to generate content from a cited source and to attribute its own answers. Experiments with Qwen2.5-7B and 3B show that Active Indexing consistently outperforms Passive Indexing across all tasks and models, with citation precision gains up to 30.2 percent. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16 times the original token count.
Problem

Research questions and friction points this paper is trying to address.

Enabling LLMs to reliably attribute answers without retrieval
Reducing latency and infrastructure dependence in citation systems
Improving citation precision for paraphrased and compositional facts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Indexing with synthetic QA pairs
Continual pretraining for document binding
Instruction tuning for citation behavior
🔎 Similar Papers
No similar papers found.