OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of tracing large language model (LLM) outputs back to their training data. We propose the first system enabling full-scale, real-time, fine-grained output provenance. Methodologically, it leverages an extended Infini-gram index integrated with efficient approximate string matching and a memory-optimized retrieval architecture, achieving millisecond-level, character-level alignment against trillion-token training corpora. Our contributions are threefold: (1) the first end-to-end, open-source, and reproducible system for real-time provenance across the entire training dataset; (2) substantially enhanced interpretability for factual verification, hallucination attribution, and creativity analysis; and (3) empirical validation on open models including OLMo, with query latency under several seconds. The system bridges a critical gap in LLM transparency, enabling precise, scalable, and operationally feasible attribution without compromising performance or accessibility.

Technology Category

Application Category

📝 Abstract
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
Problem

Research questions and friction points this paper is trying to address.

Traces LM outputs to training data sources
Identifies verbatim matches in training corpora
Helps analyze factuality and model behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time tracing of LM outputs to training data
Uses extended infini-gram for fast results
Open-source system for analyzing model behavior
🔎 Similar Papers
No similar papers found.
J
Jiacheng Liu
Allen Institute for AI, University of Washington
T
Taylor Blanton
Allen Institute for AI
Yanai Elazar
Yanai Elazar
Assistant Professor at Bar-Ilan University
NLPML
Sewon Min
Sewon Min
UC Berkeley EECS & Allen Institute for AI
Natural Language ProcessingMachine Learning
Y
YenSung Chen
Allen Institute for AI
A
Arnavi Chheda-Kothary
Allen Institute for AI, University of Washington
H
Huy Tran
Allen Institute for AI
B
Byron Bischoff
Allen Institute for AI
E
Eric Stuart Marsh
Allen Institute for AI
M
Michael Schmitz
Allen Institute for AI
C
Cassidy Trier
Allen Institute for AI
A
Aaron Sarnat
Allen Institute for AI
J
Jenna James
Allen Institute for AI
J
Jon Borchardt
Allen Institute for AI
Bailey Kuehl
Bailey Kuehl
Allen Institute for AI
E
Evie Cheng
Allen Institute for AI
K
Karen Farley
Allen Institute for AI
S
S. Sreeram
Allen Institute for AI
T
Taira Anderson
Allen Institute for AI
D
David Albright
Allen Institute for AI
C
Carissa Schoenick
Allen Institute for AI
Luca Soldaini
Luca Soldaini
Allen Institute for AI
Large Language ModelsOpen Source AIInformation Retrieval
Dirk Groeneveld
Dirk Groeneveld
Allen Institute for Artificial Intelligence
natural language processingneural networksdeep learning
Rock Yuren Pang
Rock Yuren Pang
University of Washington
Human Computer InteractionHuman-AI InteractionAccessibilityResponsible AI
Pang Wei Koh
Pang Wei Koh
University of Washington; Allen Institute for AI
Machine learningNatural language processingComputational biology
Noah A. Smith
Noah A. Smith
University of Washington; Allen Institute for Artificial Intelligence
natural language processingmachine learningcomputational social sciencecomputer music
S
Sophie Lebrecht
Allen Institute for AI
Yejin Choi
Yejin Choi
Stanford University / NVIDIA
Natural Language ProcessingDeep LearningArtificial IntelligenceCommonsense Reasoning
H
Hanna Hajishirzi
Allen Institute for AI, University of Washington
A
Ali Farhadi
Allen Institute for AI, University of Washington
Jesse Dodge
Jesse Dodge
Allen Institute for AI
NLPMachine Learning