Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) for Traditional Chinese Medicine (TCM) suffer from factual inconsistency, suboptimal training data quality, and insufficient evaluation benchmarks. To address these limitations, this paper introduces the first Groupwise Relative Policy Optimization (GRPO) framework tailored specifically for TCM. GRPO leverages intra-group preference comparisons to perform reinforcement learning, thereby enhancing both factual accuracy in complex TCM reasoning and alignment with expert-level knowledge. Built upon the Qwen2.5-7B-Instruct architecture and trained/evaluated on the TCM-Ladder subset, our method achieves state-of-the-art performance across multiple TCM-specific reasoning benchmarks—outperforming advanced general-purpose models (GPT-4, Gemini, Claude 3) as well as leading domain-specific models (BenTsao, HuatuoGPT2). These results empirically validate the efficacy and generalizability of the groupwise relative optimization paradigm for training specialized LLMs in vertically oriented professional domains.

Technology Category

Application Category

📝 Abstract
Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
Problem

Research questions and friction points this paper is trying to address.

Advancing LLMs for Traditional Chinese Medicine's unique knowledge system
Addressing alignment and data quality limitations in TCM-specific language models
Improving reasoning and factual consistency through group-relative optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Group Relative Policy Optimization reinforcement learning
Built on Qwen2.5-7B-Instruct foundation model
Trained exclusively on TCM-Ladder benchmark data
🔎 Similar Papers
No similar papers found.
J
Jiacheng Xie
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
Shuai Zeng
Shuai Zeng
University of Missouri - Columbia
BioinformaticsMachine LearningComputer Science
Y
Yang Yu
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
X
Xiaoting Tang
Community Health Service Center Shanghai Pudong New Area, Shanghai, China
G
Guanghui An
School of Acupuncture-Moxibustion and Tuina, Shanghai University of Traditional Chinese Medicine, Shanghai, China
D
Dong Xu
Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA