LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Language models (LMs) rely heavily on world knowledge, yet the mechanisms underlying knowledge acquisition and representation during pretraining remain poorly understood. Method: We introduce the first systematic analytical framework for studying knowledge learning in the pretraining phase: (i) constructing a Wikipedia-based pretraining corpus with fully annotated entity mentions; (ii) designing an efficient entity retrieval method—improving retrieval performance by 80.4%; and (iii) releasing multi-scale models and long-sequence training checkpoints. Contribution/Results: This toolkit enables fine-grained attribution analysis, dynamic modeling, and controllable intervention in knowledge formation. Empirical analysis reveals that factual frequency is a primary driver of knowledge acquisition, but exhibits pronounced learning saturation and bias. Evaluated on knowledge-intensive benchmarks, our trained models match the performance of leading open-source LMs. Our work establishes a new paradigm—and provides empirical grounding—for developing more consistent, robust, and interpretable knowledge-enhanced language models.

Technology Category

Application Category

📝 Abstract

Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.

Problem

Research questions and friction points this paper is trying to address.

Analyzing how LMs acquire knowledge from pretraining data

Understanding internal processes turning data into knowledge representations

Studying connections between entity mentions and downstream performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-rich pretraining corpus with entity annotations

Entity-based retrieval method with 80.4% improvement

Pretrained models with intermediate checkpoints for analysis

🔎 Similar Papers

Large Language Model Enhanced Knowledge Representation Learning: A Survey