Localizing and Mitigating Errors in Long-form Question Answering

๐Ÿ“… 2024-07-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Long-form question answering (LFQA) suffers from hallucination and factual inconsistency due to verbose, open-ended answers, complicating faithfulness evaluation. To address this, we introduce HaluQuestQAโ€”the first fine-grained, span-level hallucination-annotated dataset for LFQA, covering both human-written and model-generated answers. We further propose a unified feedback model that jointly performs error localization and explanation generation, and design an Error-informed Refinement prompting framework integrating supervised fine-tuning with iterative refinement. Experiments demonstrate substantial reductions in hallucination rates across multiple LLMs; human evaluation shows 84% preference for refined answers. This work establishes the first benchmark for fine-grained hallucination annotation and interpretable correction in LFQA, revealing systemic deficiencies in answer completeness and reference utility of current LFQA systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-informed refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces errors and improves answer quality across multiple models. Furthermore, humans find answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers.
Problem

Research questions and friction points this paper is trying to address.

Detect and localize hallucinations in long-form answers
Evaluate factual inconsistencies in detailed responses
Improve answer quality via error-informed refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces HaluQuestQA dataset with localized error annotations
Trains automatic feedback model to predict error spans
Proposes Error-informed refinement to improve answer quality
๐Ÿ”Ž Similar Papers
No similar papers found.