🤖 AI Summary
This work addresses critical challenges in multi-hop question answering with large language models (LLMs): low answer accuracy, poor answer faithfulness, and weak robustness to noisy or conflicting external knowledge. Inspired by the judicial Chain of Evidence (CoE) paradigm—where evidence is rigorously evaluated for logical coherence and mutual support—we introduce, for the first time, a CoE-inspired framework for LLM-based knowledge assessment. Our CoE-aware reasoning framework jointly models (i) relevance between retrieved knowledge and the query, and (ii) multi-hop logical consistency among knowledge snippets, and integrates seamlessly into retrieval-augmented generation (RAG) pipelines. Evaluated across five mainstream LLMs and three realistic RAG settings, our method consistently improves answer accuracy, faithfulness, and robustness against knowledge noise and contradictions. This establishes a novel, principled paradigm for trustworthy knowledge-enhanced reasoning.
📝 Abstract
Incorporating external knowledge into large language models (LLMs) has emerged as a promising approach to mitigate outdated knowledge and hallucination in LLMs. However, external knowledge is often imperfect. In addition to useful knowledge, external knowledge is rich in irrelevant or misinformation in the context that can impair the reliability of LLM responses. This paper focuses on LLMs' preferred external knowledge in imperfect contexts when handling multi-hop QA. Inspired by criminal procedural law's Chain of Evidence (CoE), we characterize that knowledge preferred by LLMs should maintain both relevance to the question and mutual support among knowledge pieces. Accordingly, we propose an automated CoE discrimination approach and evaluate LLMs' effectiveness, faithfulness and robustness with CoE, including its application in the Retrieval-Augmented Generation (RAG). Tests on five LLMs show CoE improves generation accuracy, answer faithfulness, robustness to knowledge conflicts, and boosts the performance of existing approaches in three practical RAG scenarios.