🤖 AI Summary
Existing LLM-based QA systems suffer from unfaithful question decomposition, incomplete evidence retrieval, and non-attributable answers. To address these issues, this paper proposes FIDES: a framework that first performs context-augmented two-stage decomposition to faithfully split long answers into verifiable sub-facts; then employs a retriever-driven evidence retrieval module coupled with conflict-aware dynamic sub-fact refinement; and finally aggregates multi-source evidence based on original syntactic structures. We introduce Attr_auto-P, an automated attribution evaluation metric, and validate FIDES across six benchmark datasets. Experiments demonstrate that FIDES achieves over 14% average improvement in attribution accuracy over state-of-the-art methods on mainstream models—including GPT-3.5-turbo, Gemini, and Llama-70B—significantly enhancing answer interpretability and factual trustworthiness.
📝 Abstract
Attribution is crucial in question answering (QA) with Large Language Models (LLMs).SOTA question decomposition-based approaches use long form answers to generate questions for retrieving related documents. However, the generated questions are often irrelevant and incomplete, resulting in a loss of facts in retrieval.These approaches also fail to aggregate evidence snippets from different documents and paragraphs. To tackle these problems, we propose a new fact decomposition-based framework called FIDES ( extit{faithful context enhanced fact decomposition and evidence aggregation}) for attributed QA. FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. If the retrieved evidence snippets conflict with the related sub-facts, such sub-facts will be revised accordingly. Finally, the evidence snippets are aggregated according to the original sentences.Extensive evaluation has been conducted with six datasets, with an additionally proposed new metric called $Attr_{auto-P}$ for evaluating the evidence precision. FIDES outperforms the SOTA methods by over 14% in average with GPT-3.5-turbo, Gemini and Llama 70B series.