Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing large language models (LLMs) for Traditional Chinese Medicine (TCM) suffer from factual inconsistency, suboptimal training data quality, and insufficient evaluation benchmarks. To address these limitations, this paper introduces the first Groupwise Relative Policy Optimization (GRPO) framework tailored specifically for TCM. GRPO leverages intra-group preference comparisons to perform reinforcement learning, thereby enhancing both factual accuracy in complex TCM reasoning and alignment with expert-level knowledge. Built upon the Qwen2.5-7B-Instruct architecture and trained/evaluated on the TCM-Ladder subset, our method achieves state-of-the-art performance across multiple TCM-specific reasoning benchmarks—outperforming advanced general-purpose models (GPT-4, Gemini, Claude 3) as well as leading domain-specific models (BenTsao, HuatuoGPT2). These results empirically validate the efficacy and generalizability of the groupwise relative optimization paradigm for training specialized LLMs in vertically oriented professional domains.

Technology Category

Application Category

📝 Abstract

Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.

Problem

Research questions and friction points this paper is trying to address.

Advancing LLMs for Traditional Chinese Medicine's unique knowledge system

Addressing alignment and data quality limitations in TCM-specific language models

Improving reasoning and factual consistency through group-relative optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Group Relative Policy Optimization reinforcement learning

Built on Qwen2.5-7B-Instruct foundation model

Trained exclusively on TCM-Ladder benchmark data

🔎 Similar Papers

No similar papers found.