Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large language models (LLMs) frequently omit critical information or inadequately cover key perspectives—especially concerning sensitive topics—posing significant reliability and safety risks. Method: This paper introduces three automated approaches for evaluating factual completeness: (1) sentence decomposition grounded in natural language inference (NLI), (2) question-answer pair extraction and alignment, and (3) end-to-end missing-information detection using large models. Contribution/Results: Experimental results demonstrate that the end-to-end method substantially outperforms conventional, complex strategies in identifying omissions, validating the efficacy of simple, discriminative LLM-based paradigms. We systematically benchmark multiple leading open-source LLMs across multi-source query tasks, revealing—through the first empirical evidence—a pervasive, systematic tendency toward information omission. This work establishes a novel, comprehensive evaluation benchmark and provides an extensible technical framework for assessing informational coverage and factual completeness in LLM-generated text.

Technology Category

Application Category

📝 Abstract

Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

Problem

Research questions and friction points this paper is trying to address.

Evaluating comprehensiveness of LLM-generated texts

Detecting missing information in generated content

Comparing automated methods for factual recall assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

NLI-based method decomposes texts into atomic statements

Q&A-based approach compares responses across different sources

End-to-end method directly identifies missing content using LLMs

🔎 Similar Papers

Factual consistency evaluation of summarization in the Era of large language models