A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the unclear ability of large language models (LLMs) to recognize and adhere to clinical practice guidelines (CPGs) in multi-turn conversations, a gap that poses potential safety risks. To systematically evaluate this, the authors introduce CPGBench, a benchmark constructed from 3,418 CPG documents issued over the past decade by nine countries and two international organizations, automatically generating 32,155 multi-turn dialogues. Eight leading LLMs are assessed using both automated metrics and manual review by 56 clinicians. The results reveal significant shortcomings: while guideline detection rates range from 71.1% to 89.6%, only 3.6% to 29.7% of responses correctly cite their sources, and adherence rates fall between 21.8% and 63.2%. These findings highlight a critical gap between current LLM capabilities and the requirements for safe clinical deployment.

Technology Category

Application Category

📝 Abstract

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

Problem

Research questions and friction points this paper is trying to address.

Clinical Practice Guidelines

Large Language Models

Multi-turn Conversations

Guideline Adherence

Healthcare AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

CPGBench

clinical practice guidelines

large language models