Data Augmentation Techniques for Chinese Disease Name Normalization

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of disease named entity annotation data in Chinese medical texts—which impedes disease naming standardization and limits model comprehension—this paper proposes the first multi-strategy data augmentation framework specifically designed for Chinese medical naming standardization. The framework integrates dictionary-constrained synonym substitution, BERT-based masked language modeling, terminology-aligned context-aware back-translation, and medical-domain adversarial perturbations to achieve robust generalization under few-shot settings. Evaluated across multiple benchmark models, it yields an average 12.7% F1-score improvement; remarkably, using only 10% of the training data, it retains 89.3% of the original performance—substantially outperforming conventional augmentation methods. The core contribution lies in establishing the first rule- and semantics-driven data augmentation system explicitly tailored for Chinese disease naming standardization.

Technology Category

Application Category

📝 Abstract
Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data
Problem

Research questions and friction points this paper is trying to address.

Chinese Medical Field
Limited Training Data
Disease Name Standardization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Augmentation
Chinese Disease Name Recognition
Limited Data Handling
🔎 Similar Papers
No similar papers found.
Wenqian Cui
Wenqian Cui
Chinese University of Hong Kong
Deep LearningNatural Language ProcessingLarge Language ModelsAI MusicMusic Generation
X
Xiangling Fu
School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China
S
Shao-Chen Liu
School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China
M
Mingjun Gu
School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China
Xien Liu
Xien Liu
Tsinghua University
Deep LearningMedicalNLPLarge Language Models
Ji Wu
Ji Wu
Tsinghua University
Artificial Intelligence,smart healthcaremachine learningpattern recognitionspeech recognition
Irwin King
Irwin King
The Chinese University of Hong Kong
social computingmachine learningAIgraph neural networksNLP