ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing Chinese medical large language model (LLM) benchmarks are predominantly static and task-isolated, failing to capture the openness, longitudinal nature, and safety-critical demands of real-world clinical practice. To address this gap, this work introduces a clinically grounded Chinese medical evaluation benchmark co-designed and validated by medical experts, comprising 2,500 open-ended case scenarios spanning 36 specialties, 12 clinical task types, and multiple difficulty levels. We propose the Clinically Applicable Consistency Score (CACS@k) alongside a dual-reviewer framework that integrates expert annotations, rubric-based scoring protocols, lightweight judge models, and LLM-as-a-judge techniques to enable scalable evaluation aligned with physician judgment. Comprehensive assessment of leading Chinese medical LLMs reveals significant performance variations across tasks, disease stages, and specialties, highlighting treatment plan generation as a persistent bottleneck.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

medical LLMs

clinical benchmark

evaluation

real-world clinical workflows

Chinese medical evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ClinConsensus

medical LLM evaluation

dual-judge framework