QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current evaluations of medical large language models predominantly rely on standardized multiple-choice questions, which fail to capture the complexity, ambiguity, and long-tail demands of real-world clinical consultations. To address this limitation, this work introduces QuarkMedBench, a high-ecological-validity benchmark comprising over 20,000 single- and multi-turn dialogues across three real-world scenarios: clinical care, health and wellness, and professional consultation. The authors propose an automated scoring framework that leverages multi-model consensus and evidence retrieval to dynamically generate fine-grained, updatable rubrics for structured assessment of medical accuracy, key-point coverage, and risk mitigation. This framework achieves 91.8% agreement with blinded clinical expert reviews and reveals that leading models significantly underperform in authentic settings compared to conventional metrics, thereby validating the necessity and effectiveness of the proposed benchmark.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

medical benchmark

real-world medical queries

evaluation gap

open-ended responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

real-world medical benchmark

automated scoring framework

evidence-based retrieval