BianCang: A Traditional Chinese Medicine Large Language Model

📅 2024-11-17

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

194K/year

🤖 AI Summary

Existing large language models (LLMs) exhibit limited performance on Traditional Chinese Medicine (TCM) pattern differentiation and treatment (PDT) tasks, primarily due to the profound theoretical divergence between TCM and modern biomedicine, coupled with a scarcity of high-quality, structured TCM corpora. To address this, we propose the first LLM specifically designed for TCM PDT. Our method introduces a novel two-stage training paradigm: knowledge injection followed by clinical case–based instruction alignment. We construct ChP-TCM—a comprehensive dataset aligned with the *Chinese Pharmacopoeia: TCM Volume*—and curate multi-source, real-world hospital clinical case instruction data. Furthermore, we integrate domain-adaptive knowledge injection, standardized TCM terminology modeling, and heterogeneous corpus unified representation. Evaluated across 11 benchmarks and four core PDT tasks—syndrome identification, prescription recommendation, etiology analysis, and classical citation—we significantly outperform 29 baseline models. The code, datasets, and model are publicly released.

Technology Category

Application Category

📝 Abstract

The rise of large language models (LLMs) has driven significant progress in medical applications, including traditional Chinese medicine (TCM). However, current medical LLMs struggle with TCM diagnosis and syndrome differentiation due to substantial differences between TCM and modern medical theory, and the scarcity of specialized, high-quality corpora. This paper addresses these challenges by proposing BianCang, a TCM-specific LLM, using a two-stage training process that first injects domain-specific knowledge and then aligns it through targeted stimulation. To enhance diagnostic and differentiation capabilities, we constructed pre-training corpora, instruction-aligned datasets based on real hospital records, and the ChP-TCM dataset derived from the Pharmacopoeia of the People's Republic of China. We compiled extensive TCM and medical corpora for continuous pre-training and supervised fine-tuning, building a comprehensive dataset to refine the model's understanding of TCM. Evaluations across 11 test sets involving 29 models and 4 tasks demonstrate the effectiveness of BianCang, offering valuable insights for future research. Code, datasets, and models are available at https://github.com/QLU-NLP/BianCang.

Problem

Research questions and friction points this paper is trying to address.

Addresses TCM diagnosis challenges with specialized LLM

Overcomes scarcity of high-quality traditional Chinese medicine corpora

Enhances syndrome differentiation through domain-specific knowledge injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training process for TCM LLM

Domain knowledge injection and targeted alignment

Comprehensive TCM datasets from hospital records

🔎 Similar Papers

No similar papers found.