MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of large language models (LLMs) in Traditional Chinese Medicine (TCM) lack systematicity and clinical authenticity, focusing predominantly on factual question answering while neglecting critical competencies such as diagnostic reasoning, prescription generation, and safety compliance. Method: We introduce MTCMB—the first multi-task TCM benchmark—comprising 12 subtasks across five categories: knowledge QA, linguistic understanding, diagnostic reasoning, prescription generation, and safety assessment. It integrates real-world medical cases, national licensure examination questions, and classical TCM texts, co-developed by certified TCM practitioners. MTCMB uniquely incorporates syndrome differentiation reasoning, herbal formula planning, and contraindication identification, evaluated via domain-specific metrics including multi-granularity annotation, adversarial samples, and syndrome consistency scoring. Results: Experiments reveal that state-of-the-art LLMs perform reasonably on foundational knowledge but exhibit significant deficiencies in clinical reasoning, personalized prescription formulation, and safety judgment. We open-source the benchmark, evaluation toolkit, and baseline results to establish a reproducible standard for TCM AI evaluation.

Technology Category

Application Category

📝 Abstract
Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: https://github.com/Wayyuanyuan/MTCMB.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on TCM knowledge, reasoning, and safety
Addressing lack of standardization in TCM computational modeling
Assessing LLM performance in clinical reasoning and prescription planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task benchmark for TCM evaluation
Includes real-world cases and exams
Assesses knowledge, reasoning, and safety
🔎 Similar Papers
No similar papers found.
Shufeng Kong
Shufeng Kong
Cornell University
Computational sustainability
X
Xingru Yang
School of Software Engineering, Sun Yat-sen University, Zhuhai, China
Y
Yuanyuan Wei
School of Software Engineering, Sun Yat-sen University, Zhuhai, China
Zijie Wang
Zijie Wang
University of Arizona
Natural Language Processing
H
Hao Tang
Institute of TCM Diagnostics, Hunan University of Chinese Medicine, Changsha, China
J
Jiuqi Qin
Institute of TCM Diagnostics, Hunan University of Chinese Medicine, Changsha, China
S
Shuting Lan
Institute of TCM Diagnostics, Hunan University of Chinese Medicine, Changsha, China
Yingheng Wang
Yingheng Wang
Cornell University
Computer Science
Junwen Bai
Junwen Bai
Google DeepMind
Machine LearningSequence LearningSpeech Recognition
Zhuangbin Chen
Zhuangbin Chen
Assistant Professor, School of Software Engineering, Sun Yat-sen University
Software EngineeringDistributed SystemsCloud ComputingLLM Systems
Zibin Zheng
Zibin Zheng
IEEE Fellow, Highly Cited Researcher, Sun Yat-sen University, China
BlockchainSmart ContractServices ComputingSoftware Reliability
C
Caihua Liu
School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, China
H
Hao Liang
Institute of TCM Diagnostics, Hunan University of Chinese Medicine, Changsha, China