🤖 AI Summary
Large language models (LLMs) often exhibit hallucination and low contextual faithfulness in long-form question answering. To address this, we propose GenDiE, a sentence-level self-evolution framework. GenDiE introduces a novel fine-grained, sentence-level optimization paradigm that decomposes model responses into independent units and establishes a closed-loop “generate–discriminate–evolve” pipeline. It jointly trains generation and discrimination modules via instruction tuning and contrastive learning, enabling the model to autonomously construct aligned training data. Furthermore, it integrates confidence-guided beam search and fine-grained faithfulness scoring. Evaluated on ASQA and ConFiQA benchmarks, GenDiE significantly improves response faithfulness and answer correctness while demonstrating strong cross-domain generalization. This work establishes a new paradigm for trustworthy retrieval-augmented generation systems.
📝 Abstract
Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.