🤖 AI Summary
Clinical large language models lack rigorous clinical validation of their reasoning processes, hindering their trustworthy deployment in clinical decision support. To address this, we propose the Clinical Evidence Graph (CEG)—a verifiable and traceable structured reasoning-path modeling framework that explicitly constrains medical reasoning to evidence-based nodes and causal chains for the first time. Methodologically, we design a three-dimensional reward function measuring node coverage, structural correctness, and chain completeness, integrated with clinical-knowledge-guided Proximal Policy Optimization (PPO) reinforcement learning and an algorithmic CEG generation pipeline. Our approach achieves significant improvements over state-of-the-art methods across multiple medical reasoning benchmarks. Generated reasoning chains receive high clinical credibility scores from domain experts (mean 4.82/5.0). We publicly release our code, models, and a challenging case dataset to foster reproducible research and clinical evaluation.
📝 Abstract
Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at https://github.com/LinjieMu/MedCEG.