π€ AI Summary
This work addresses the challenge that current clinical practice guidelines (CPGs), typically represented as free-text documents, are ill-suited for explicitly modeling their underlying decision logic in language model training or retrieval. To overcome this limitation, the study introduces a novel approach that first converts CPGs into executable, programmatic decision structures and then leverages these to generate factual and counterfactual question-answer pairs, thereby constructing structured supervision signals for fine-tuning large medical language models. This enables the models to internalize guideline-driven clinical reasoning rather than merely memorizing surface-level text. Experimental results demonstrate an average relative accuracy improvement of 10.28% across four clinical reasoning benchmarks. Furthermore, clinician evaluations confirm that the modelβs generated explanations exhibit significantly higher fidelity, validity, completeness, and clarity compared to baseline methods.
π Abstract
Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text training data or retrieval sources, underutilizing their procedural decision structure. To better exploit this structure, we introduce a guideline-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question-answering data. Theses data teach models both guideline-supported decisions and how decisions change under different patient conditions. Post-training a medical LLM on the generated data yields MedGuideX. Across four clinical reasoning benchmarks, MedGuideX achieves a 10.28% relative improvement in average accuracy. Physician evaluation further shows that MedGuideX better recovers clinician authored reasoning steps and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs.