LiveClin: A Live Clinical Benchmark without Leakage

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of medical large language models are hindered by data contamination and outdated knowledge, leading to inflated static benchmark scores that poorly reflect real-world clinical competence. To address this, this work proposes LiveClin—a dynamically updated, multimodal clinical assessment benchmark grounded in the latest peer-reviewed case reports. Through a collaborative pipeline involving AI and 239 physicians, real patient cases are transformed into complex questions spanning the full clinical pathway. LiveClin introduces a novel contamination-resistant continuous updating mechanism, integrating rigorous de-identification, biannual data refreshes, and multimodal question design to ensure timeliness and clinical fidelity. The benchmark comprises 1,407 cases and 6,605 questions; evaluation across 26 models reveals a top case-level accuracy of only 35.7%, substantially below that of attending and senior physicians, highlighting a significant gap between current models and clinical experts.

Technology Category

Application Category

📝 Abstract
The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human experts, Chief Physicians achieved the highest accuracy, followed closely by Attending Physicians, with both surpassing most models. LiveClin thus provides a continuously evolving, clinically grounded framework to guide the development of medical LLMs towards closing this gap and achieving greater reliability and real-world utility. Our data and code are publicly available at https://github.com/AQ-MedAI/LiveClin.
Problem

Research questions and friction points this paper is trying to address.

data contamination
knowledge obsolescence
medical LLM evaluation
static benchmarks
clinical reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

live benchmark
data contamination
clinical currency
multimodal evaluation
medical LLMs
🔎 Similar Papers
No similar papers found.