Decomposing and Revising What Language Models Generate

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Existing LLM-based QA systems suffer from unfaithful question decomposition, incomplete evidence retrieval, and non-attributable answers. To address these issues, this paper proposes FIDES: a framework that first performs context-augmented two-stage decomposition to faithfully split long answers into verifiable sub-facts; then employs a retriever-driven evidence retrieval module coupled with conflict-aware dynamic sub-fact refinement; and finally aggregates multi-source evidence based on original syntactic structures. We introduce Attr_auto-P, an automated attribution evaluation metric, and validate FIDES across six benchmark datasets. Experiments demonstrate that FIDES achieves over 14% average improvement in attribution accuracy over state-of-the-art methods on mainstream models—including GPT-3.5-turbo, Gemini, and Llama-70B—significantly enhancing answer interpretability and factual trustworthiness.

Technology Category

Application Category

📝 Abstract

Attribution is crucial in question answering (QA) with Large Language Models (LLMs).SOTA question decomposition-based approaches use long form answers to generate questions for retrieving related documents. However, the generated questions are often irrelevant and incomplete, resulting in a loss of facts in retrieval.These approaches also fail to aggregate evidence snippets from different documents and paragraphs. To tackle these problems, we propose a new fact decomposition-based framework called FIDES ( extit{faithful context enhanced fact decomposition and evidence aggregation}) for attributed QA. FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. If the retrieved evidence snippets conflict with the related sub-facts, such sub-facts will be revised accordingly. Finally, the evidence snippets are aggregated according to the original sentences.Extensive evaluation has been conducted with six datasets, with an additionally proposed new metric called $Attr_{auto-P}$ for evaluating the evidence precision. FIDES outperforms the SOTA methods by over 14% in average with GPT-3.5-turbo, Gemini and Llama 70B series.

Problem

Research questions and friction points this paper is trying to address.

Improving question relevance in decomposition-based QA retrieval

Addressing incomplete fact decomposition in evidence aggregation

Resolving conflicts between retrieved evidence and generated sub-facts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage faithful decomposition for sub-facts

Evidence retrieval with conflict-based revision

Aggregation by original sentence structure

🔎 Similar Papers

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models