Improving Model Factuality with Fine-grained Critique-based Evaluator

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

157K/year

🤖 AI Summary

This work addresses the factual inaccuracies prevalent in large language model (LLM) outputs by proposing FenCE, a statement-level, interpretable framework for factual evaluation and optimization. Methodologically: (1) it introduces the first fine-grained, statement-level factual evaluator supporting multi-source document grounding and textual critique; (2) it designs an evaluator-driven hallucination-free response revision and preference training paradigm that jointly models factual judgment, critique generation, and response refinement. Experiments show FenCE improves accuracy by 2.9% on LLMAggreFact; after optimization, Llama2-7B-chat and Llama3-8B-chat achieve 16.86% and 14.45% absolute gains in factual consistency on FActScore, respectively—outperforming existing fine-tuning approaches. The core contribution lies in establishing the first interpretable, document-grounded, statement-level factual evaluation mechanism and closing the evaluation–revision–preference optimization loop.

Technology Category

Application Category

📝 Abstract

Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama2-7B-chat and Llama3-8B-chat's factuality rate by 16.86% and 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 8.83% and 6.96%.

Problem

Research questions and friction points this paper is trying to address.

Detect factual errors in language model outputs

Train evaluator for claim-level factuality feedback

Improve model factuality via critique-based training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained critique-based evaluator for factuality

Data augmentation on public judgment datasets

Framework leveraging evaluator to improve LM factuality

🔎 Similar Papers

OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs