OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

📅 2024-08-06

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 31

✨ Influential: 1

career value

159K/year

🤖 AI Summary

To address the pervasive hallucination problem in large language model (LLM) outputs and the lack of standardized, open-domain factual evaluation protocols, this paper introduces FACTEVAL—the first open-source, end-to-end unified factual evaluation framework. FACTEVAL proposes a novel tri-module collaborative architecture—RESPONSEEVAL, LLMEVAL, and CHECKEREVAL—that jointly assesses factual correctness at the response, model, and verifier levels, respectively, unifying interfaces, metrics, and benchmarking procedures. Implemented in Python, it integrates claim extraction, evidence retrieval, and claim verification, enabling customizable verification pipelines. It is publicly available as a PyPI package, on GitHub, and via a web service. Extensive evaluation across multiple benchmarks demonstrates significant improvements in assessment consistency and reproducibility. FACTEVAL has been widely adopted in the LLM safety and trustworthiness research community.

Technology Category

Application Category

📝 Abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures,which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.

Problem

Research questions and friction points this paper is trying to address.

Assessing factual accuracy of free-form LLM outputs

Unifying diverse factuality evaluation benchmarks and measures

Providing customizable tools for automatic fact-checking system assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for LLM factuality evaluation

Customizable fact-checking system for input documents

Evaluates both LLMs and fact-checking systems

🔎 Similar Papers

OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs