TCM-Eval: An Expert-Level Dynamic and Extensible Benchmark for Traditional Chinese Medicine

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

The traditional Chinese medicine (TCM) domain lacks standardized evaluation benchmarks and high-quality training data, hindering rigorous assessment and advancement of TCM-oriented large language models (LLMs). Method: We introduce TCM-Eval—the first dynamic, extensible LLM evaluation benchmark for TCM—curated from the national TCM physician licensure examination question bank and validated by domain experts. We propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE), a novel data synthesis method that employs rejection sampling to automatically generate high-fidelity reasoning chains, enabling co-iterative optimization of both data and model. Concurrently, we construct a large-scale, domain-specific TCM corpus and develop ZhiMingTang (ZMT), an open-source LLM fine-tuned on this corpus. Contribution/Results: ZMT significantly surpasses the passing threshold of the national TCM physician examination. TCM-Eval establishes the first multi-level, scalable evaluation framework for TCM AI, accompanied by a public leaderboard, thereby advancing standardization, reproducibility, and sustainable development in TCM artificial intelligence.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in modern medicine, yet their application in Traditional Chinese Medicine (TCM) remains severely limited by the absence of standardized benchmarks and the scarcity of high-quality training data. To address these challenges, we introduce TCM-Eval, the first dynamic and extensible benchmark for TCM, meticulously curated from national medical licensing examinations and validated by TCM experts. Furthermore, we construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich question-answer pairs with validated reasoning chains through rejection sampling, establishing a virtuous cycle of data and model co-evolution. Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM, which significantly exceeds the passing threshold for human practitioners. To encourage future research and development, we release a public leaderboard, fostering community engagement and continuous improvement.

Problem

Research questions and friction points this paper is trying to address.

Developing standardized benchmarks for Traditional Chinese Medicine LLMs

Addressing scarcity of high-quality training data in TCM

Creating expert-validated evaluation systems for medical AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic benchmark curated from national examinations

Self-iterative chain-of-thought enhancement for reasoning chains

Large-scale training corpus with expert-validated question-answer pairs

🔎 Similar Papers

No similar papers found.