🤖 AI Summary
The traditional Chinese medicine (TCM) domain lacks standardized evaluation benchmarks and high-quality training data, hindering rigorous assessment and advancement of TCM-oriented large language models (LLMs).
Method: We introduce TCM-Eval—the first dynamic, extensible LLM evaluation benchmark for TCM—curated from the national TCM physician licensure examination question bank and validated by domain experts. We propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE), a novel data synthesis method that employs rejection sampling to automatically generate high-fidelity reasoning chains, enabling co-iterative optimization of both data and model. Concurrently, we construct a large-scale, domain-specific TCM corpus and develop ZhiMingTang (ZMT), an open-source LLM fine-tuned on this corpus.
Contribution/Results: ZMT significantly surpasses the passing threshold of the national TCM physician examination. TCM-Eval establishes the first multi-level, scalable evaluation framework for TCM AI, accompanied by a public leaderboard, thereby advancing standardization, reproducibility, and sustainable development in TCM artificial intelligence.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in modern medicine, yet their application in Traditional Chinese Medicine (TCM) remains severely limited by the absence of standardized benchmarks and the scarcity of high-quality training data. To address these challenges, we introduce TCM-Eval, the first dynamic and extensible benchmark for TCM, meticulously curated from national medical licensing examinations and validated by TCM experts. Furthermore, we construct a large-scale training corpus and propose Self-Iterative Chain-of-Thought Enhancement (SI-CoTE) to autonomously enrich question-answer pairs with validated reasoning chains through rejection sampling, establishing a virtuous cycle of data and model co-evolution. Using this enriched training data, we develop ZhiMingTang (ZMT), a state-of-the-art LLM specifically designed for TCM, which significantly exceeds the passing threshold for human practitioners. To encourage future research and development, we release a public leaderboard, fostering community engagement and continuous improvement.