H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current DDx evaluation relies on flat metrics (e.g., Top-k), which fail to distinguish clinically proximate errors from irrelevant ones and neglect hierarchical diagnostic relationships. To address this, we propose H-DDx—a hierarchical differential diagnosis evaluation framework that maps free-text diagnoses to ICD-10 codes via a retrieval-reranking pipeline, enables semantic matching using a medical knowledge graph, and introduces an ICD-10–informed scoring mechanism quantifying clinical proximity across anatomical, pathological, and etiological dimensions. Experiments across 22 mainstream LLMs demonstrate that H-DDx significantly improves clinical plausibility and interpretability of evaluation: it reveals that models often capture correct clinical context, whereas conventional metrics underestimate performance by an average of 18.7%. Domain-specific open-source models—particularly Med-PaLM-M—achieve the highest scores under H-DDx.

Technology Category

Application Category

📝 Abstract
An accurate differential diagnosis (DDx) is essential for patient care, shaping therapeutic decisions and influencing outcomes. Recently, Large Language Models (LLMs) have emerged as promising tools to support this process by generating a DDx list from patient narratives. However, existing evaluations of LLMs in this domain primarily rely on flat metrics, such as Top-k accuracy, which fail to distinguish between clinically relevant near-misses and diagnostically distant errors. To mitigate this limitation, we introduce H-DDx, a hierarchical evaluation framework that better reflects clinical relevance. H-DDx leverages a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies a hierarchical metric that credits predictions closely related to the ground-truth diagnosis. In benchmarking 22 leading models, we show that conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with our results highlighting the strengths of domain-specialized open-source models. Furthermore, our framework enhances interpretability by revealing hierarchical error patterns, demonstrating that LLMs often correctly identify the broader clinical context even when the precise diagnosis is missed.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' differential diagnosis accuracy using hierarchical metrics
Addresses limitations of flat metrics in clinical error assessment
Maps free-text diagnoses to ICD-10 codes for better relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical evaluation framework for clinical relevance
Retrieval and reranking pipeline mapping diagnoses to ICD-10
Hierarchical metric crediting predictions related to ground-truth
🔎 Similar Papers
No similar papers found.
S
Seungseop Lim
AITRICS, KAIST
Gibaeg Kim
Gibaeg Kim
AITRICS
NLPLLM
H
Hyunkyung Lee
AITRICS, KAIST
W
Wooseok Han
AITRICS, KAIST
J
Jean Seo
AITRICS, KAIST
J
Jaehyo Yoo
AITRICS, KAIST
Eunho Yang
Eunho Yang
KAIST
Machine LearningStatistics