LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing evaluation benchmarks for large Chinese medicine language models suffer from incomplete coverage and inconsistent scoring criteria, hindering fair assessment of models’ capabilities in traditional Chinese medicine (TCM) knowledge and clinical reasoning. To address this, this work proposes LingLanMiDian, a large-scale, expert-curated multitask benchmark encompassing four core tasks: knowledge recall, multi-hop reasoning, information extraction, and clinical decision-making. The framework introduces unified evaluation metrics, a synonym-tolerant clinical labeling protocol, and a challenging 400-question “Hard” subset per task, reframing diagnostic and therapeutic recommendations as single-choice decision identification. Evaluated under a zero-shot paradigm, the benchmark is standardized and extensible. Assessments of 14 leading large language models reveal substantial gaps between current models and human experts in TCM commonsense understanding and reasoning, particularly on the Hard subsets.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and proprietary LLMs, providing a unified perspective on their strengths and limitations in TCM commonsense knowledge understanding, reasoning, and clinical decision support; critically, the evaluation on Hard subset reveals a substantial gap between current models and human experts in TCM-specialized reasoning. By bridging fundamental knowledge and applied reasoning through standardized evaluation, LingLan establishes a unified, quantitative, and extensible foundation for advancing TCM LLMs and domain-specific medical AI research. All evaluation data and code are available at https://github.com/TCMAI-BJTU/LingLan and http://tcmnlp.com.

Problem

Research questions and friction points this paper is trying to address.

Traditional Chinese Medicine

Large Language Models

Benchmark Evaluation

Clinical Reasoning

Medical NLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Traditional Chinese Medicine

Large Language Models

Benchmark Evaluation