Beyond Questions: Evaluating What Large Language Models (Actually) Know

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Current knowledge evaluation of large language models (LLMs) predominantly relies on predefined questions, which are susceptible to availability bias and fail to comprehensively reflect the models’ actual knowledge. This work proposes an open-ended knowledge assessment paradigm that elicits proactive knowledge generation through open prompts—such as “Tell me everything you know about a given entity”—and automatically verifies the accuracy of model statements against reference corpora. We introduce BeQu, a benchmark dataset comprising 10,000 entities, and integrate prompt engineering, statement extraction, and fact-checking techniques to systematically evaluate the knowledge expression capabilities of mainstream LLMs. Our experiments reveal how model scale, inference budget, prompt format, and knowledge domain influence knowledge expression. The dataset and associated leaderboard are publicly released to foster more authentic and holistic evaluations of model knowledge.

📝 Abstract

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

Problem

Research questions and friction points this paper is trying to address.

knowledge evaluation

large language models

availability bias

parametric knowledge

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

open knowledge evaluation

parametric knowledge

large language models