Training and Evaluation of Guideline-Based Medical Reasoning in LLMs

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large language models (LLMs) often lack adherence to clinical consensus guidelines in early prediction tasks, undermining reasoning credibility and interpretability—particularly in prospective forecasting, the core challenge of clinical early warning. Method: We propose a consensus-guideline–based reasoning parsing framework, fine-tuning LLMs on finely annotated electronic health records and integrating them with time-series forecasting models to form a multimodal architecture capable of robust prospective inference over sparse, irregular clinical variables. Contribution/Results: Our framework enables the first automated dual-dimensional evaluation of both reasoning path correctness (per clinical guidelines) and numerical prediction accuracy. Experiments show that a small fine-tuned model achieves near-100% compliance with Sepsis-3 diagnostic criteria—including nuanced rule exceptions—significantly outperforming large models under zero-shot prompting. This advances trustworthy, guideline-conformant clinical decision support.

Technology Category

Application Category

📝 Abstract

Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.

Problem

Research questions and friction points this paper is trying to address.

Teaching LLMs to follow medical consensus guidelines for reasoning

Automatically evaluating model inference correctness using consensus rules

Improving early prediction by forecasting sparse clinical variables

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune LLMs with verbalized medical consensus guidelines and exceptions.

Automatically evaluate model inference using derivation and value correctness.

Integrate time series forecasting with LLMs for clinical variable prediction.

🔎 Similar Papers

No similar papers found.