Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Automatic ICD coding in clinical NLP suffers from severely low macro-F1 due to the long-tailed distribution of diagnosis codes—especially in MIMIC-III, where thousands of codes are rare or zero-shot. Method: We propose a data-centric synthetic augmentation framework that leverages real-world co-occurrence patterns, ICD ontology descriptions, and terminology hierarchies to construct multilabel code sets; guided by structured prompts, large language models generate 90K high-quality synthetic discharge notes covering nearly 8,000 ICD codes. Contribution/Results: The synthetic data substantially alleviates label imbalance, improving prediction fairness across tail classes. Fine-tuning state-of-the-art models (e.g., PLM-ICD, GKI-ICD) on this data yields significant macro-F1 gains while preserving high micro-F1—outperforming prior SOTA. This work provides the first systematic validation of ontology-informed LLM-generated synthetic data for rare-disease coding, demonstrating both efficacy and scalability.

Technology Category

Application Category

📝 Abstract

Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.

Problem

Research questions and friction points this paper is trying to address.

Addressing extreme long-tail distribution of diagnostic codes in medical NLP

Mitigating underrepresentation of rare ICD codes in clinical datasets

Improving automatic ICD coding performance for rare diagnostic codes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic clinical notes using structured prompts

Leverages co-occurrence patterns and ICD taxonomy for realism

Fine-tunes transformer models on extended dataset distribution

🔎 Similar Papers

No similar papers found.