Evaluating the Unseen Capabilities: How Many Theorems Do LLMs Know?

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluation often misjudges true capability by overlooking “unseen knowledge”—latent knowledge encoded in the model but absent from test sets. This paper introduces KnowSum, a statistical framework that pioneers the application of capture-recapture methodology to LLM assessment, enabling extrapolation of unobserved knowledge volume from knowledge-instance frequency distributions. Its core contributions are threefold: (1) systematic identification of the “unseen knowledge” blind spot in evaluation; (2) the first statistically grounded method for quantifying and extrapolating the scale of unobserved knowledge; and (3) reconstruction of LLM capability rankings based on total latent knowledge, substantially enhancing evaluation completeness. Experiments demonstrate that KnowSum outperforms conventional metrics in estimating theorem-level knowledge, measuring retrieval effectiveness, and quantifying generation diversity, while yielding more intrinsically consistent model rankings.

Technology Category

Application Category

📝 Abstract
Accurate evaluation of large language models (LLMs) is crucial for understanding their capabilities and guiding their development. However, current evaluations often inconsistently reflect the actual capacities of these models. In this paper, we demonstrate that one of many contributing factors to this extit{evaluation crisis} is the oversight of unseen knowledge -- information encoded by LLMs but not directly observed or not yet observed during evaluations. We introduce KnowSum, a statistical framework designed to provide a more comprehensive assessment by quantifying the unseen knowledge for a class of evaluation tasks. KnowSum estimates the unobserved portion by extrapolating from the appearance frequencies of observed knowledge instances. We demonstrate the effectiveness and utility of KnowSum across three critical applications: estimating total knowledge, evaluating information retrieval effectiveness, and measuring output diversity. Our experiments reveal that a substantial volume of knowledge is omitted when relying solely on observed LLM performance. Importantly, KnowSum yields significantly different comparative rankings for several common LLMs based on their internal knowledge.
Problem

Research questions and friction points this paper is trying to address.

Assessing unseen knowledge in LLMs for accurate evaluation
Addressing evaluation crisis by quantifying unobserved capabilities
Improving LLM rankings via comprehensive knowledge estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces KnowSum for unseen knowledge estimation
Uses frequency extrapolation to assess unobserved data
Evaluates LLMs on total knowledge and diversity
🔎 Similar Papers
No similar papers found.