Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This work addresses a critical limitation in the factual evaluation of long-context large language models: an overemphasis on precision at the expense of recall in retrieving relevant facts. To remedy this, the authors propose a novel evaluation framework that jointly considers both precision and recall, introducing for the first time an importance-aware recall mechanism. The approach extracts reference facts from external knowledge sources such as Wikipedia, decomposes them into atomic claim units, and weights these claims according to their relevance and salience to comprehensively assess the factual completeness of generated text. Experimental results demonstrate that while current models achieve high precision, they exhibit substantially low recall in covering all essential facts, thereby underscoring the necessity and effectiveness of the proposed framework.

Technology Category

Application Category

📝 Abstract

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

Problem

Research questions and friction points this paper is trying to address.

factuality evaluation

recall

long-form generation

large language models

factual completeness

Innovation

Methods, ideas, or system contributions that make the work stand out.

factuality evaluation

importance-aware recall

long-form generation