Synthetic Data Generation for Long-Tail Medical Image Classification: A Case Study in Skin Lesions

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
This work addresses the challenge of low diagnostic accuracy for rare tail classes—such as high-risk skin lesions—in medical image classification under long-tailed data distributions. To mitigate this issue, the authors propose a diffusion model–based synthetic data augmentation approach that integrates an inpainting diffusion model with an out-of-distribution (OOD) sample post-selection mechanism. This framework generates diverse, photorealistic, and clinically meaningful synthetic images, enhancing data diversity while preserving semantic fidelity. Experiments on the ISIC2019 dataset demonstrate that the proposed method significantly improves long-tailed classification performance, boosting accuracy for the least-represented tail classes by over 28%, thereby overcoming the limitations of conventional data augmentation techniques in balancing realism and diversity.
📝 Abstract
Long-tailed class distributions are pervasive in multi-class medical datasets and pose significant challenges for deep learning models which typically underperform on tail classes with limited samples. This limitation is particularly problematic in medical applications, where rare classes often correspond to severe or high-risk diseases and therefore require high diagnostic accuracy. Existing solutions-including specialized architectures, rebalanced loss functions, and handcrafted data augmentation-offer only marginal improvements and struggle to scale due to their limited and largely deterministic variability. To address these challenges, we introduce a diffusion-model-driven synthetic data augmentation pipeline tailored for medical long-tailed classification. Our approach features a novel inpainting diffusion model combined with an Out-of-Distribution (OOD) post-selection mechanism to ensure diverse, realistic, and clinically meaningful synthetic samples. Evaluated on the ISIC2019 skin lesion classification dataset, one of the largest and most imbalanced medical imaging benchmarks, our method yields substantial improvements in overall performance, with particularly pronounced gains on tail classes with more than $28\%$ improvement on the class with the fewest samples. These results demonstrate the effectiveness of diffusion-based augmentation in mitigating long-tail imbalance and enhancing medical classification robustness.
Problem

Research questions and friction points this paper is trying to address.

long-tailed classification
medical image analysis
class imbalance
skin lesion classification
rare disease diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model
synthetic data generation
long-tail classification
medical image analysis
out-of-distribution detection
🔎 Similar Papers
No similar papers found.