ScienceMeter: Tracking Scientific Knowledge Updates in Language Models

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of scientific knowledge obsolescence in large language models (LLMs). To systematically evaluate LLMs’ capacity for knowledge updating, we propose a three-dimensional framework: knowledge preservation (past), knowledge acquisition (present), and knowledge projection (future). We introduce ScienceMeter—the first comprehensive evaluation suite for scientific knowledge updating—defining the preservation/acquisition/projection paradigm and releasing a benchmark dataset comprising over 30,000 scientific claims spanning 10 years and 10 disciplines. We design dual evaluation tasks—claim verification and claim generation—and integrate five representative update approaches, including fine-tuning and inference-time enhancement methods. Experimental results show that the best-performing method achieves 85.9%, 71.7%, and 37.7% accuracy on preservation, acquisition, and projection, respectively. We further find that inference-based methods suit larger models better, while training-based methods are more effective for smaller ones; moreover, performance across all three dimensions exhibits strong correlation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models' understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.
Problem

Research questions and friction points this paper is trying to address.

Evaluating outdated scientific knowledge in Large Language Models
Assessing knowledge preservation, acquisition, and projection in LLMs
Analyzing cross-domain performance of knowledge update methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

ScienceMeter framework evaluates knowledge updates
Metrics: preservation, acquisition, projection
Tests five update methods on 15,444 papers
🔎 Similar Papers
No similar papers found.