Building a Chinese Medical Dialogue System: Integrating Large-scale Corpora and Novel Models

📅 2024-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The COVID-19 pandemic exposed critical deficiencies in traditional healthcare systems’ capabilities for online triage and consultation. This study addresses two key bottlenecks: (1) scarcity and narrow domain coverage of Chinese medical dialogue data, limiting pre-trained model performance; and (2) lack of medical knowledge in existing methods, hindering accurate understanding of clinical terminology and patient expressions. To this end, we construct LCMDC—the first open-source, large-scale Chinese medical dialogue corpus comprising over one million real doctor–patient conversations—and propose a triage model integrating BERT-based supervised fine-tuning with prompt learning, alongside a consultation model leveraging GPT and domain-adaptive pre-training. Key contributions include: (i) the first publicly available large-scale Chinese medical dialogue corpus; (ii) the first application of prompt learning to Chinese medical triage; and (iii) enhanced medical knowledge representation via domain-specific pre-training. Experiments show a 12.3% improvement in triage accuracy and 91.7% clinical relevance in consultation responses.

Technology Category

Application Category

📝 Abstract
The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in Chinese medical dialogue systems
Improving understanding of medical terminology in consultations
Developing effective triage and consultation models using PLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates BERT and GPT models
Uses large-scale medical corpora
Enhances domain knowledge acquisition
🔎 Similar Papers
No similar papers found.
X
Xinyuan Wang
Xi’an Jiaotong University
H
Haozhou Li
Xi’an Jiaotong University
D
Dingfang Zheng
Xi’an Jiaotong University
Q
Qinke Peng
Xi’an Jiaotong University