🤖 AI Summary
Traditional citation-based metrics suffer from temporal lag, disciplinary bias, and the Matthew effect, limiting their ability to provide timely and equitable assessments of scholarly impact. This work proposes a novel evaluation paradigm leveraging the parametric memory of large language models (LLMs) to gauge academic influence by quantifying model recall strength through four multiple-choice probing tasks—title, author, method, and venue—designed to test implicit knowledge of recent publications. Evaluated on 549 computer science papers from 2023–2024 across 15 mainstream LLMs (0.5B–72B parameters), the approach demonstrates significant predictive validity (Spearman ρ = 0.1495, p = 0.0004), with stronger correlations for 2024 papers (ρ = 0.1880) and author identification emerging as the most effective probe. Notably, the compact Llama-3.2-3B-Instruct (3B parameters) outperforms larger models, suggesting smaller architectures may offer superior discriminative capacity in scholarly information filtering.
📝 Abstract
Citation counts remain the dominant metric for assessing research impact, yet they suffer from well-documented limitations: temporal lag, disciplinary bias, and Matthew effects. Here we propose LLM-Metrics, a research-impact assessment metric derived from the parametric memory of large language models (LLMs). The central hypothesis is that high-impact papers receive greater exposure in the academic community, that this exposure enters LLM training data in textual form, and that models consequently form stronger parametric memory of these papers. We designed four types of multiple-choice probes, covering title recognition, author recognition, method recognition, and venue recognition, and evaluated 549 computer science papers published in 2023-2024 across 17 LLMs spanning 0.5B to 72B parameters from six vendors. Of the 17 models, 15 produced positive predictions, 9 of which were significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 and p = 0.0004 against citation counts. Three additional findings support the proposed mechanism. First, the predictive signal was stronger for 2024 papers, rho = 0.1880, whose citation counts were near zero at model-training time, reducing the plausibility of a simple reverse-causality explanation. Second, author-recognition probes showed the strongest discriminative power, consistent with an exposure-driven memory mechanism. Third, model scale and predictive power were non-monotonic: a 3B-parameter model, Llama-3.2-3B-Instruct, with rho = 0.1829, outperformed most larger models, supporting a selective-memory hypothesis in which the limited capacity of smaller models can serve as an effective information filter. LLM-Metrics offers a real-time, cross-disciplinary, citation-independent paradigm for research assessment.