π€ AI Summary
Existing evaluation benchmarks for large Chinese medicine language models suffer from incomplete coverage and inconsistent scoring criteria, hindering fair assessment of modelsβ capabilities in traditional Chinese medicine (TCM) knowledge and clinical reasoning. To address this, this work proposes LingLanMiDian, a large-scale, expert-curated multitask benchmark encompassing four core tasks: knowledge recall, multi-hop reasoning, information extraction, and clinical decision-making. The framework introduces unified evaluation metrics, a synonym-tolerant clinical labeling protocol, and a challenging 400-question βHardβ subset per task, reframing diagnostic and therapeutic recommendations as single-choice decision identification. Evaluated under a zero-shot paradigm, the benchmark is standardized and extensible. Assessments of 14 leading large language models reveal substantial gaps between current models and human experts in TCM commonsense understanding and reasoning, particularly on the Hard subsets.
π Abstract
Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and proprietary LLMs, providing a unified perspective on their strengths and limitations in TCM commonsense knowledge understanding, reasoning, and clinical decision support; critically, the evaluation on Hard subset reveals a substantial gap between current models and human experts in TCM-specialized reasoning. By bridging fundamental knowledge and applied reasoning through standardized evaluation, LingLan establishes a unified, quantitative, and extensible foundation for advancing TCM LLMs and domain-specific medical AI research. All evaluation data and code are available at https://github.com/TCMAI-BJTU/LingLan and http://tcmnlp.com.