DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models

πŸ“… 2024-08-04
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing large language models (LLMs) exhibit insufficient interpretability in clinical diagnostic reasoning, particularly failing to replicate the complete human physician reasoning chainβ€”β€œobservation β†’ inference β†’ diagnosis.” Method: We introduce DiReCT, the first diagnostic reasoning benchmark, comprising 511 physician-annotated clinical notes and an accompanying domain-specific medical knowledge graph. We propose a novel evaluation paradigm that jointly leverages structured knowledge graphs and unstructured clinical text for reasoning assessment. Contribution/Results: DiReCT is the first benchmark to systematically quantify LLM performance along two interpretability dimensions: diagnostic pathway plausibility and conclusion accuracy. Experiments reveal that even state-of-the-art models (e.g., GPT-4), when augmented with domain knowledge, significantly underperform human physicians in both dimensions. These findings underscore the critical need for knowledge-enhanced architectures and explicitly interpretable modeling in clinical AI systems.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have recently showcased remarkable capabilities, spanning a wide range of tasks and applications, including those in the medical domain. Models like GPT-4 excel in medical question answering but may face challenges in the lack of interpretability when handling complex tasks in real clinical settings. We thus introduce the diagnostic reasoning dataset for clinical notes (DiReCT), aiming at evaluating the reasoning ability and interpretability of LLMs compared to human doctors. It contains 511 clinical notes, each meticulously annotated by physicians, detailing the diagnostic reasoning process from observations in a clinical note to the final diagnosis. Additionally, a diagnostic knowledge graph is provided to offer essential knowledge for reasoning, which may not be covered in the training data of existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant gap between their reasoning ability and that of human doctors, highlighting the critical need for models that can reason effectively in real-world clinical scenarios.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Medical Reasoning
Complex Healthcare Scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiReCT Dataset
Knowledge Graph Assisted Models
Medical Reasoning Evaluation
πŸ”Ž Similar Papers
No similar papers found.
B
Bowen Wang
Premium Research Institute for Human Metaverse Medicine (WPI-PRIMe), Osaka University
J
Jiuyang Chang
Department of Cardiology, The First Affiliated Hospital of Dalian Medical University
Y
Yiming Qian
Agency for Science, Technology and Research (A*STAR)
G
Guoxin Chen
J
Junhao Chen
D3 Center, Osaka University
Zhouqiang Jiang
Zhouqiang Jiang
D3 Center, Osaka University
J
Jiahao Zhang
D3 Center, Osaka University
Yuta Nakashima
Yuta Nakashima
SANKEN, The University of Osaka
Computer VisionPattern RecognitionNatural Language Processing
Hajime Nagahara
Hajime Nagahara
Professor of Osaka University
Computational PhotographyComputer Vision