Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face challenges in diagnostic reasoning due to insufficient knowledge reliability and interpretability. To address this, we propose a novel reward modeling paradigm grounded in knowledge graph (KG) inference paths: an LLM is trained as a path correctness discriminator, emulating clinicians’ evaluation of diagnostic logic—marking the first systematic exploration of reward-driven structured clinical reasoning. Our method integrates biomedical KGs (e.g., UMLS), path-level reward training, model distillation, and multi-task formalization to generate fine-grained, interpretable supervision signals for reasoning. Experiments demonstrate significant improvements in LLM accuracy on path judgment tasks; however, gains do not fully transfer to downstream diagnostic generalization, revealing both the promise and inherent limitations of path-based reward modeling for enhancing clinical reasoning trustworthiness.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show promise for diagnostic reasoning but often lack reliable, knowledge grounded inference. Knowledge graphs (KGs), such as the Unified Medical Language System (UMLS), offer structured biomedical knowledge that can support trustworthy reasoning. Prior approaches typically integrate KGs via retrieval augmented generation or fine tuning, inserting KG content into prompts rather than enabling structured reasoning. We explore an alternative paradigm: treating the LLM as a reward model of KG reasoning paths, where the model learns to judge whether a candidate path leads to correct diagnosis for a given patient input. This approach is inspired by recent work that leverages reward training to enhance model reasoning abilities, and grounded in computational theory, which suggests that verifying a solution is often easier than generating one from scratch. It also parallels physicians' diagnostic assessment, where they judge which sequences of findings and intermediate conditions most plausibly support a diagnosis. We first systematically evaluate five task formulation for knowledge path judging and eight training paradigm. Second, we test whether the path judging abilities generalize to downstream diagnostic tasks, including diagnosis summarization and medical question answering. Experiments with three open source instruct-tuned LLMs reveal both promise and brittleness: while specific reward optimization and distillation lead to strong path-judging performance, the transferability to downstream tasks remain weak. Our finding provides the first systematic assessment of "reward model style" reasoning over clinical KGs, offering insights into how structured, reward-based supervision influences diagnostic reasoning in GenAI systems for healthcare.
Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable knowledge grounded inference in diagnostic reasoning using LLMs
Integrating structured biomedical knowledge from KGs for trustworthy clinical reasoning
Evaluating reward model approaches for judging diagnostic reasoning paths in healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM as reward model for KG paths
Judges reasoning paths instead of generating them
Systematically evaluates path judging training paradigms
🔎 Similar Papers
No similar papers found.