Decomposing and Revising What Language Models Generate

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based QA systems suffer from unfaithful question decomposition, incomplete evidence retrieval, and non-attributable answers. To address these issues, this paper proposes FIDES: a framework that first performs context-augmented two-stage decomposition to faithfully split long answers into verifiable sub-facts; then employs a retriever-driven evidence retrieval module coupled with conflict-aware dynamic sub-fact refinement; and finally aggregates multi-source evidence based on original syntactic structures. We introduce Attr_auto-P, an automated attribution evaluation metric, and validate FIDES across six benchmark datasets. Experiments demonstrate that FIDES achieves over 14% average improvement in attribution accuracy over state-of-the-art methods on mainstream models—including GPT-3.5-turbo, Gemini, and Llama-70B—significantly enhancing answer interpretability and factual trustworthiness.

Technology Category

Application Category

📝 Abstract
Attribution is crucial in question answering (QA) with Large Language Models (LLMs).SOTA question decomposition-based approaches use long form answers to generate questions for retrieving related documents. However, the generated questions are often irrelevant and incomplete, resulting in a loss of facts in retrieval.These approaches also fail to aggregate evidence snippets from different documents and paragraphs. To tackle these problems, we propose a new fact decomposition-based framework called FIDES ( extit{faithful context enhanced fact decomposition and evidence aggregation}) for attributed QA. FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. If the retrieved evidence snippets conflict with the related sub-facts, such sub-facts will be revised accordingly. Finally, the evidence snippets are aggregated according to the original sentences.Extensive evaluation has been conducted with six datasets, with an additionally proposed new metric called $Attr_{auto-P}$ for evaluating the evidence precision. FIDES outperforms the SOTA methods by over 14% in average with GPT-3.5-turbo, Gemini and Llama 70B series.
Problem

Research questions and friction points this paper is trying to address.

Improving question relevance in decomposition-based QA retrieval
Addressing incomplete fact decomposition in evidence aggregation
Resolving conflicts between retrieved evidence and generated sub-facts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage faithful decomposition for sub-facts
Evidence retrieval with conflict-based revision
Aggregation by original sentence structure
Z
Zhichao Yan
School of Computer and Information Technology, Shanxi University, Taiyuan, China
Jiaoyan Chen
Jiaoyan Chen
Department of Computer Science, University of Manchester
Knowledge GraphOntologyMachine LearningLarge Language Model
J
Jiapu Wang
Beijing University of Technology, Beijing, China
X
Xiaoli Li
Singapore University of Technology and Design, Singapore
Ru Li
Ru Li
Harbin Institute of Technology
Jeff Z. Pan
Jeff Z. Pan
Professor of Knowledge Computing, University of Edinburgh
Artificial IntelligenceKnowledge Representation and ReasoningKnowledge Based Learning