GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

📅 2024-02-19

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Large language models (LLMs) frequently generate factual inaccuracies in high-stakes domains such as healthcare and finance, undermining trust in deployed applications. To address this, we propose DocAudit—the first document-augmented fact-checking framework that integrates fine-grained, claim-level factual auditing with interactive evidence presentation. Methodologically, DocAudit combines retrieval augmentation, sequence labeling, and generative correction to train specialized claim classification and editing models, and introduces a visual, interactive interface supporting cross-model and cross-domain factual verification. Our contributions are threefold: (1) the first end-to-end system for automatic claim identification, revision, or deletion of unsupported statements, coupled with traceable evidence for verifiable claims; (2) high-precision error detection across eight mainstream LLM-based document summarization tasks; and (3) a user study demonstrating over 40% improvement in human error detection accuracy. The code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. User studies demonstrate that using GenAudit can substantially improve the performance of humans at finding errors in LLM-generated summaries. We release our tool (GenAudit) and fact-checking model for public use.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Accuracy Improvement

Critical Domain Applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

GenAudit

Error Correction

Transparency in AI

🔎 Similar Papers

Re-Ex: Revising after Explanation Reduces the Factual Errors in LLM Responses