🤖 AI Summary
This study investigates how the granularity of the Fitzpatrick Skin Type (FST) scale affects the performance and fairness of AI-driven dermatologic image classification models. Using a controlled experimental design, we trained multiple classifiers on balanced datasets stratified by varying FST groupings—binary (1/2 vs. 3/4 vs. 5/6), ternary (1–2, 3–4, 5–6), and merged (four-group) schemes—and systematically evaluated their accuracy, cross-group generalization, and bias distribution in malignant versus benign lesion classification. Results demonstrate that the ternary FST partition achieves the optimal trade-off between overall accuracy and inter-group fairness. In contrast, coarser granularities—particularly the four-group merged scheme—significantly degrade model robustness and minority-skin-type performance, exposing structural bias arising from anthropogenic simplification of the FST scale. To our knowledge, this is the first empirical study to identify FST granularity as a critical design dimension governing fairness in dermatologic AI. We further advocate for biologically grounded, high-resolution skin-tone representations to supplant current categorical scales.
📝 Abstract
Artificial intelligence (AI) models to automatically classify skin lesions from dermatology images have shown promising performance but also susceptibility to bias by skin tone. The most common way of representing skin tone information is the Fitzpatrick Skin Tone (FST) scale. The FST scale has been criticised for having greater granularity in its skin tone categories for lighter-skinned subjects. This paper conducts an investigation of the impact (on performance and bias) on AI classification models of granularity in the FST scale. By training multiple AI models to classify benign vs. malignant lesions using FST-specific data of differing granularity, we show that: (i) when training models using FST-specific data based on three groups (FST 1/2, 3/4 and 5/6), performance is generally better for models trained on FST-specific data compared to a general model trained on FST-balanced data; (ii) reducing the granularity of FST scale information (from 1/2 and 3/4 to 1/2/3/4) can have a detrimental effect on performance. Our results highlight the importance of the granularity of FST groups when training lesion classification models. Given the question marks over possible human biases in the choice of categories in the FST scale, this paper provides evidence for a move away from the FST scale in fair AI research and a transition to an alternative scale that better represents the diversity of human skin tones.