🤖 AI Summary
To address incomplete factual coverage in long-text generation and the lack of interpretability in current evaluation methods, this paper proposes ICAT: a framework that decomposes generated text into atomic claims and jointly applies knowledge retrieval for factual verification and multi-strategy semantic alignment—via rule-based matching, embedding similarity, and LLM-based judgment—to systematically model factual diversity and completeness. ICAT is the first to enable fine-grained, modular, and interpretable factual evaluation under unsupervised or weakly supervised settings, and is specifically adapted to TREC Web Track and ClueWeb benchmarks. Experiments demonstrate strong correlation between ICAT scores and human judgments. Moreover, ICAT supports cross-model, cross-task, and cross-domain quality diagnostics and analysis across multiple state-of-the-art large language models, significantly enhancing the reliability and practical utility of factual evaluation for long-form text generation.
📝 Abstract
This paper presents ICAT, an evaluation framework for measuring coverage of diverse factual information in long-form text generation. ICAT breaks down a long output text into a list of atomic claims and not only verifies each claim through retrieval from a (reliable) knowledge source, but also computes the alignment between the atomic factual claims and various aspects expected to be presented in the output. We study three implementations of the ICAT framework, each with a different assumption on the availability of aspects and alignment method. By adopting data from the diversification task in the TREC Web Track and the ClueWeb corpus, we evaluate the ICAT framework. We demonstrate strong correlation with human judgments and provide comprehensive evaluation across multiple state-of-the-art LLMs. Our framework further offers interpretable and fine-grained analysis of diversity and coverage. Its modular design allows for easy adaptation to different domains and datasets, making it a valuable tool for evaluating the qualitative aspects of long-form responses produced by LLMs.