🤖 AI Summary
Existing long-text factuality assessment methods rely on claim decomposition and evidence retrieval but suffer from pipeline complexity, low efficiency, inaccurate claim extraction, and fragmented evidence. This paper proposes FACTBLOCK, a block-level claim extraction and document-level evidence joint verification framework. It introduces a confidence-based pre-filtering mechanism to reduce futile retrievals and integrates selective web crawling with cross-paragraph evidence aggregation, significantly improving evidential sufficiency and alignment with human judgments. Evaluated on multiple human-annotated benchmarks, FACTBLOCK substantially outperforms state-of-the-art methods in both assessment accuracy (human alignment) and reasoning/search efficiency. To our knowledge, it is the first approach to achieve synergistic optimization of high accuracy and high efficiency in long-text factuality assessment.
📝 Abstract
Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line snippets.
To address these limitations, we propose
ame, a fast and strong evaluation framework that achieves the highest alignment with human evaluation and efficiency among existing baselines.
ame first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it collects document-level evidence from crawled webpages and selectively retrieves it during verification, addressing the evidence insufficiency problem in previous pipelines.
Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of
ame in both efficiently and effectively evaluating the factuality of long-form LLM generations. Code and benchmark data is available at https://github.com/Yingjia-Wan/FastFact.