LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing recall evaluation metrics exhibit limitations in long-text scenarios: lexical-overlap-based methods frequently misclassify unverified entities and paraphrased answers, while holistic LLM-based assessment lacks structured verification and suffers from misalignment and hallucination. This paper proposes LongRecall—a three-stage recall evaluation framework. First, generated text is decomposed into atomic facts; second, hierarchical candidate retrieval is performed by jointly leveraging lexical matching and semantic filtering; third, fine-grained verification is conducted via structured-prompt-driven textual entailment judgment. LongRecall is the first to integrate fact decomposition and structured logical validation into recall evaluation, substantially reducing false positives and false negatives while improving robustness to linguistic variation and contextual shifts. Evaluated on three long-text QA benchmarks, LongRecall consistently outperforms strong baselines. Dual validation—via human annotators and LLMs—confirms significant improvements in both recall accuracy and reliability.

Technology Category

Application Category

📝 Abstract

LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.

Problem

Research questions and friction points this paper is trying to address.

Evaluating recall completeness in long-form machine-generated texts

Addressing limitations of lexical overlap and LLM hallucination in metrics

Providing structured verification for factual alignment in diverse phrasings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes answers into self-contained facts

Narrows candidate matches through filtering

Verifies alignment via structured entailment checks

🔎 Similar Papers

Recall Them All: Retrieval-Augmented Language Models for Long Object List Extraction from Long Documents