LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses critical limitations in existing medical large language model (LLM) evaluation benchmarks—namely data contamination, outdated knowledge, and the absence of robust clinical reasoning assessment. To overcome these issues, the authors introduce the first dynamically updated, temporally isolated medical benchmark, constructed by continuously collecting real-world clinical cases from active medical communities. The dataset incorporates multi-agent clinical review, evidence-based validation, and fine-grained automated scoring to ensure data integrity and timeliness. It comprises 2,756 authentic cases spanning 38 specialties and 16,702 evaluation criteria. Evaluations across 38 leading LLMs reveal that even the best-performing model achieves only 39.2% accuracy, and 84% of models suffer significant performance degradation when context is truncated, highlighting inefficient context utilization as a key bottleneck in current medical LLMs.

Technology Category

Application Category

📝 Abstract

The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.

Problem

Research questions and friction points this paper is trying to address.

data contamination

temporal misalignment

clinical reasoning

medical benchmark

evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

contamination-free benchmark

automated rubric evaluation

multi-agent clinical curation