QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of medical large language models predominantly rely on standardized multiple-choice questions, which fail to capture the complexity, ambiguity, and long-tail demands of real-world clinical consultations. To address this limitation, this work introduces QuarkMedBench, a high-ecological-validity benchmark comprising over 20,000 single- and multi-turn dialogues across three real-world scenarios: clinical care, health and wellness, and professional consultation. The authors propose an automated scoring framework that leverages multi-model consensus and evidence retrieval to dynamically generate fine-grained, updatable rubrics for structured assessment of medical accuracy, key-point coverage, and risk mitigation. This framework achieves 91.8% agreement with blinded clinical expert reviews and reveals that leading models significantly underperform in authentic settings compared to conventional metrics, thereby validating the necessity and effectiveness of the proposed benchmark.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
medical benchmark
real-world medical queries
evaluation gap
open-ended responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

real-world medical benchmark
automated scoring framework
evidence-based retrieval
multi-model consensus
dynamic rubric generation
🔎 Similar Papers
No similar papers found.
Y
Yao Wu
Quark Medical Team, Alibaba Group
K
Kangping Yin
Quark Medical Team, Alibaba Group
L
Liang Dong
Quark Medical Team, Alibaba Group
Z
Zhenxin Ma
Quark Medical Team, Alibaba Group
S
Shuting Xu
Quark Medical Team, Alibaba Group
Xuehai Wang
Xuehai Wang
Department of Learning, Informatics, Management & Ethics, Karolinska Institutet
Machine learningImmunologyBiomedical AIOncologyMultimodal
Y
Yuxuan Jiang
Quark Medical Team, Alibaba Group
Tingting Yu
Tingting Yu
Associate Professor, University of Connecticut
Software EngineeringSoftware Testing
Y
Yunqing Hong
Quark Medical Team, Alibaba Group
J
Jiayi Liu
Quark Medical Team, Alibaba Group
R
Rianzhe Huang
Quark Medical Team, Alibaba Group
S
Shuxin Zhao
Quark Medical Team, Alibaba Group
H
Haiping Hu
Quark Medical Team, Alibaba Group
Wen Shang
Wen Shang
King's College London
Wireless CommunicationsWireless NetworkingNon-Terrestrial Networks
Jian Xu
Jian Xu
Senior Director, Ad Platform, Alibaba Group
Computational AdvertisingMachine LearningData MiningData Privacy
G
Guanjun Jiang
Quark Medical Team, Alibaba Group