Learning Facts at Scale with Active Reading

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Large language models (LLMs) struggle to reliably acquire and recall specific factual knowledge, primarily because parametric memory relies on training data frequency and lacks controllable learning mechanisms. To address this, we propose Active Reading—a novel framework enabling LLMs to autonomously generate learning strategies and systematically internalize domain-specific knowledge even at pretraining scale. Our method integrates self-generated strategy fine-tuning, large-scale text generation, and knowledge distillation, specifically designed for deep comprehension of expert documents. Experiments demonstrate substantial improvements: +313% accuracy on SimpleQA (reaching 66%), and +160% on FinanceBench. Our released Meta WikiExpert-8B model—trained on one trillion tokens—outperforms models with hundreds of billions of parameters on factual QA tasks. The core contribution is a guided, reproducible, and high-precision paradigm for systematic factual knowledge acquisition in LLMs.

Technology Category

Application Category

📝 Abstract

LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release Meta WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.

Problem

Research questions and friction points this paper is trying to address.

Improving reliability of fact learning in LLMs

Ensuring consistent knowledge absorption from documents

Enhancing factual recall without increasing model size

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Reading framework for self-generated learning strategies

Training models to study given material effectively

Pre-training scale application for more factual models

🔎 Similar Papers

No similar papers found.