TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The traditional Chinese medicine (TCM) domain lacks standardized evaluation benchmarks and high-quality training data, hindering rigorous assessment and advancement of TCM-oriented large language models (LLMs). Method: We introduce TCM-Eval—the first dynamic, extensible LLM evaluation benchmark for TCM—curated from the national TCM physician licensure examination question bank and validated by domain experts. We propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE), a novel data synthesis method that employs rejection sampling to automatically generate high-fidelity reasoning chains, enabling co-iterative optimization of both data and model. Concurrently, we construct a large-scale, domain-specific TCM corpus and develop ZhiMingTang (ZMT), an open-source LLM fine-tuned on this corpus. Contribution/Results: ZMT significantly surpasses the passing threshold of the national TCM physician examination. TCM-Eval establishes the first multi-level, scalable evaluation framework for TCM AI, accompanied by a public leaderboard, thereby advancing standardization, reproducibility, and sustainable development in TCM artificial intelligence.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in modern medicine, yet their application in Traditional Chinese Medicine (TCM) remains severely limited by the absence of standardized benchmarks and the scarcity of high-quality training data. To address these challenges, we introduce TCM-Eval, the first dynamic and extensible benchmark for TCM, meticulously curated from national medical licensing examinations and validated by TCM experts. Furthermore, we construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich question-answer pairs with validated reasoning chains through rejection sampling, establishing a virtuous cycle of data and model co-evolution. Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM, which significantly exceeds the passing threshold for human practitioners. To encourage future research and development, we release a public leaderboard, fostering community engagement and continuous improvement.
Problem

Research questions and friction points this paper is trying to address.

Developing standardized benchmarks for Traditional Chinese Medicine LLMs
Addressing scarcity of high-quality training data in TCM
Creating expert-validated evaluation systems for medical AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmark curated from national examinations
Self-iterative chain-of-thought enhancement for reasoning chains
Large-scale training corpus with expert-validated question-answer pairs
🔎 Similar Papers
No similar papers found.
Z
Zihao Cheng
School of Computer Science and Engineering, Beihang University
Yuheng Lu
Yuheng Lu
Peking University
3D Computer Vision
H
Huaiqian Ye
School of Computer Science and Engineering, Beihang University
Z
Zeming Liu
School of Computer Science and Engineering, Beihang University
M
Minqi Wang
Beijing Zhimingtang Technology Co., Ltd.
J
Jingjing Liu
School of Computer Science and Engineering, Beihang University
Zihan Li
Zihan Li
University of Washington
Foundation ModelAI for HealthcareMultimodal Learning
W
Wei Fan
Beijing Zhimingtang Technology Co., Ltd.
Yuanfang Guo
Yuanfang Guo
Beihang University
Multimedia securityAI securityGraph Neural NetworksMultimedia processing
R
Ruiji Fu
Beijing Zhiyan AI Technology Co., Ltd.
S
Shifeng She
Beijing Zhimingtang Technology Co., Ltd., Guangzhou University of Chinese Medicine
G
Gang Wang
Beijing Zhimingtang Technology Co., Ltd.
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision