Data Augmentation Techniques for Chinese Disease Name Normalization

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

To address the scarcity of disease named entity annotation data in Chinese medical texts—which impedes disease naming standardization and limits model comprehension—this paper proposes the first multi-strategy data augmentation framework specifically designed for Chinese medical naming standardization. The framework integrates dictionary-constrained synonym substitution, BERT-based masked language modeling, terminology-aligned context-aware back-translation, and medical-domain adversarial perturbations to achieve robust generalization under few-shot settings. Evaluated across multiple benchmark models, it yields an average 12.7% F1-score improvement; remarkably, using only 10% of the training data, it retains 89.3% of the original performance—substantially outperforming conventional augmentation methods. The core contribution lies in establishing the first rule- and semantics-driven data augmentation system explicitly tailored for Chinese disease naming standardization.

Technology Category

Application Category

📝 Abstract

Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data

Problem

Research questions and friction points this paper is trying to address.

Chinese Medical Field

Limited Training Data

Disease Name Standardization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Augmentation

Chinese Disease Name Recognition

Limited Data Handling

🔎 Similar Papers

No similar papers found.