From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Retrieval-Augmented Generation (RAG) systems often produce erroneous or unreliable answers when confronted with conflicting, outdated, or subjective external evidence. Method: We propose a reasoning-trace-enhanced RAG framework comprising three stages—document adjudication, conflict analysis, and grounded synthesis—to enable explainable, verifiable answer generation or principled refusal. Our approach integrates reasoning-trace modeling, LLM-as-a-Judge evaluation, supervised fine-tuning (SFT), structured evidence linking, and an explicit refusal mechanism. Contribution/Results: We introduce the first Conflict-Aware Trustworthiness Scoring (CATS) pipeline; construct ConflictQA, the first 539-query reasoning dataset tailored to conflict scenarios; and propose a unified behavioral supervision paradigm. Evaluated on Qwen, our method achieves end-to-end answer accuracy of 0.883 (up from 0.069) and behavioral compliance of 0.722 (up from 0.074), substantially outperforming existing RAG baselines.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.
Problem

Research questions and friction points this paper is trying to address.

Addresses conflicting or outdated information in retrieval-augmented LLMs
Integrates structured reasoning for document adjudication and synthesis
Evaluates groundedness and refusal accuracy with a conflict-aware pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a reasoning-trace-augmented RAG framework
Adds structured reasoning across three interpretable stages
Uses a Conflict-Aware Trust-Score evaluation pipeline
🔎 Similar Papers
No similar papers found.
Shubham Mishra
Shubham Mishra
PhD Student
Distributed SystemsCryptography
S
Samyek Jain
Birla Institute of Technology and Science, Pilani
G
Gorang Mehrishi
Birla Institute of Technology and Science, Pilani
S
Shiv Tiwari
Birla Institute of Technology and Science, Pilani
H
Harsh Sharma
Carnegie Mellon University, Pittsburgh
P
Pratik Narang
Birla Institute of Technology and Science, Pilani
D
Dhruv Kumar
Birla Institute of Technology and Science, Pilani