🤖 AI Summary
Automatic ICD coding in clinical NLP suffers from severely low macro-F1 due to the long-tailed distribution of diagnosis codes—especially in MIMIC-III, where thousands of codes are rare or zero-shot.
Method: We propose a data-centric synthetic augmentation framework that leverages real-world co-occurrence patterns, ICD ontology descriptions, and terminology hierarchies to construct multilabel code sets; guided by structured prompts, large language models generate 90K high-quality synthetic discharge notes covering nearly 8,000 ICD codes.
Contribution/Results: The synthetic data substantially alleviates label imbalance, improving prediction fairness across tail classes. Fine-tuning state-of-the-art models (e.g., PLM-ICD, GKI-ICD) on this data yields significant macro-F1 gains while preserving high micro-F1—outperforming prior SOTA. This work provides the first systematic validation of ontology-informed LLM-generated synthetic data for rare-disease coding, demonstrating both efficacy and scalability.
📝 Abstract
Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.