MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language model (LLM) evaluation benchmarks struggle to assess dynamic capabilities in real-world clinical settings. To address this gap, this work proposes MLB, the first scenario-driven benchmark tailored for clinical practice, encompassing five dimensions: medical knowledge, safety and ethics, clinical note comprehension, intelligent services, and smart healthcare. MLB integrates 22 datasets—17 newly constructed—with contributions from 300 licensed physicians. We introduce a comprehensive evaluation framework that combines foundational medical knowledge with multidimensional clinical reasoning and train a specialized judge model via supervised fine-tuning on 19k expert-annotated samples. This yields a highly consistent and scalable evaluator, achieving 92.1% accuracy and a Cohen’s Kappa of 81.3% against human judgments. Evaluations reveal that state-of-the-art models excel in structured tasks (87.8%) but significantly underperform in realistic patient interaction scenarios (61.3%), highlighting critical challenges for clinical deployment.

Technology Category

Application Category

📝 Abstract
The proliferation of Large Language Models (LLMs) presents transformative potential for healthcare, yet practical deployment is hindered by the absence of frameworks that assess real-world clinical utility. Existing benchmarks test static knowledge, failing to capture the dynamic, application-oriented capabilities required in clinical practice. To bridge this gap, we introduce a Medical LLM Benchmark MLB, a comprehensive benchmark evaluating LLMs on both foundational knowledge and scenario-based reasoning. MLB is structured around five core dimensions: Medical Knowledge (MedKQA), Safety and Ethics (MedSE), Medical Record Understanding (MedRU), Smart Services (SmartServ), and Smart Healthcare (SmartCare). The benchmark integrates 22 datasets (17 newly curated) from diverse Chinese clinical sources, covering 64 clinical specialties. Its design features a rigorous curation pipeline involving 300 licensed physicians. Besides, we provide a scalable evaluation methodology, centered on a specialized judge model trained via Supervised Fine-Tuning (SFT) on expert annotations. Our comprehensive evaluation of 10 leading models reveals a critical translational gap: while the top-ranked model, Kimi-K2-Instruct (77.3% accuracy overall), excels in structured tasks like information extraction (87.8% accuracy in MedRU), performance plummets in patient-facing scenarios (61.3% in SmartServ). Moreover, the exceptional safety score (90.6% in MedSE) of the much smaller Baichuan-M2-32B highlights that targeted training is equally critical. Our specialized judge model, trained via SFT on a 19k expert-annotated medical dataset, achieves 92.1% accuracy, an F1-score of 94.37%, and a Cohen's Kappa of 81.3% for human-AI consistency, validating a reproducible and expert-aligned evaluation protocol. MLB thus provides a rigorous framework to guide the development of clinically viable LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Clinical Benchmark
Scenario-Based Evaluation
Medical AI
Real-World Utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

scenario-driven benchmark
clinical LLM evaluation
expert-aligned judge model
medical reasoning
Supervised Fine-Tuning (SFT)
🔎 Similar Papers
No similar papers found.
Q
Qing He
Ant Group, Hangzhou, China
D
Dongsheng Bi
Ant Group, Hangzhou, China
Jianrong Lu
Jianrong Lu
ZJU
Generative ModelFederated Learning
Minghui Yang
Minghui Yang
Ant Group
NLPDialogueGraph3DV
Z
Zixiao Chen
Ant Group, Hangzhou, China
J
Jiacheng Lu
Ant Group, Hangzhou, China
J
Jing Chen
Ant Group, Hangzhou, China
N
Nannan Du
Ant Group, Hangzhou, China
X
Xiao Cu
Ant Group, Hangzhou, China
S
Sijing Wu
Health Information Center of Zhejiang Province, Hangzhou, China
Peng Xiang
Peng Xiang
Tsinghua University
Deep Learning3D Computer VisionPoint Cloud
Y
Yinyin Hu
Health Information Center of Zhejiang Province, Hangzhou, China
Y
Yi Guo
Health Information Center of Zhejiang Province, Hangzhou, China
C
Chunpu Li
Health Information Center of Zhejiang Province, Hangzhou, China
S
Shaoyang Li
Ant Group, Hangzhou, China
Z
Zhuo Dong
Ant Group, Hangzhou, China
M
Ming Jiang
Ant Group, Hangzhou, China
S
Shuai Guo
Ant Group, Hangzhou, China
L
Liyun Feng
Ant Group, Hangzhou, China
J
Jin Peng
Ant Group, Hangzhou, China
Jian Wang
Jian Wang
Senior Staff Algorithm Engineer, Ant Group
Computer VisionMultimodalLLM
Jinjie Gu
Jinjie Gu
ant group
机器学习,推荐
J
Junwei Liu
School of Software and Microelectronics, Peking University, Beijing, China