Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM clinical evaluations predominantly rely on simplified QA benchmarks (e.g., MedQA), failing to capture the complexity and multidimensionality of real-world clinical decision-making. Method: We propose a dual-dimensional evaluation paradigm grounded in authentic clinical scenarios—orthogonally modeling tasks along *clinical context* (e.g., patient demographics, care settings) and *clinical reasoning* (e.g., diagnostic inference, therapeutic trade-offs)—to transcend the limitations of conventional single-answer QA assessment. Our framework integrates quantitative metrics across accuracy, reasoning efficiency, interpretability, and robustness, and systematically compares model performance under diverse clinical decision paradigms via combined training-time interventions and test-time enhancements. Contribution/Results: The study rigorously delineates the applicability boundaries of mainstream datasets and methods, identifies critical bottlenecks in clinical reasoning, and establishes a standardized, actionable benchmark for the trustworthy deployment of LLMs in clinical decision support.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (QA) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of simplified medical QA datasets
Proposing unified paradigm for real-world clinical decision-making
Extending evaluation beyond accuracy to efficiency and explainability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Characterizes clinical decision-making along two dimensions
Reviews training-time and test-time techniques
Extends evaluation to efficiency and explainability
🔎 Similar Papers
No similar papers found.
Y
Yunpeng Xiao
Department of Computer Science, Emory University
Carl Yang
Carl Yang
Waymo LLC, PhD at University of California, Davis
GPU ComputingParallel ComputingGraph Processing
M
Mark Mai
Children’s Healthcare of Atlanta
X
Xiao Hu
Nell Hodgson Woodruff School of Nursing, Emory University
Kai Shu
Kai Shu
Assistant Professor of Computer Science, Emory University
Data MiningTrustworthy AISocial ComputingMachine LearningAI Safety