🤖 AI Summary
The impact of preprocessing complexity and generative model selection on the quality of synthetic dermoscopic images and downstream melanoma diagnosis remains poorly understood. Method: We introduce SkinGenBench—the first benchmark dedicated to dermoscopic image synthesis—and systematically compare StyleGAN2-ADA and DDPM models, evaluating geometric augmentation and artifact removal preprocessing strategies. Contribution/Results: Generative architecture choice dominates both image fidelity (FID ≈ 65.5; KID ≈ 0.05) and diagnostic utility, outweighing preprocessing effects; excessive artifact removal degrades clinically critical textural cues. Integrating synthetic data boosts ViT-B/16 performance to F1 = 0.88 and ROC-AUC = 0.98—improving baseline scores by 8–15 percentage points. This work establishes, for the first time, the central role of generative model design in medical image synthesis and provides a methodological foundation for trustworthy AI-assisted skin cancer diagnosis.
📝 Abstract
This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of 14,116 dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID (~65.5) and KID (~0.05), while diffusion models generated higher variance samples at the cost of reduces perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with 8-15% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1~0.88 and ROC-AUC~0.98, representing an improvement of approximately 14% over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench