The Impact of Skin Tone Label Granularity on the Performance and Fairness of AI Based Dermatology Image Classification Models

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study investigates how the granularity of the Fitzpatrick Skin Type (FST) scale affects the performance and fairness of AI-driven dermatologic image classification models. Using a controlled experimental design, we trained multiple classifiers on balanced datasets stratified by varying FST groupings—binary (1/2 vs. 3/4 vs. 5/6), ternary (1–2, 3–4, 5–6), and merged (four-group) schemes—and systematically evaluated their accuracy, cross-group generalization, and bias distribution in malignant versus benign lesion classification. Results demonstrate that the ternary FST partition achieves the optimal trade-off between overall accuracy and inter-group fairness. In contrast, coarser granularities—particularly the four-group merged scheme—significantly degrade model robustness and minority-skin-type performance, exposing structural bias arising from anthropogenic simplification of the FST scale. To our knowledge, this is the first empirical study to identify FST granularity as a critical design dimension governing fairness in dermatologic AI. We further advocate for biologically grounded, high-resolution skin-tone representations to supplant current categorical scales.

Technology Category

Application Category

📝 Abstract

Artificial intelligence (AI) models to automatically classify skin lesions from dermatology images have shown promising performance but also susceptibility to bias by skin tone. The most common way of representing skin tone information is the Fitzpatrick Skin Tone (FST) scale. The FST scale has been criticised for having greater granularity in its skin tone categories for lighter-skinned subjects. This paper conducts an investigation of the impact (on performance and bias) on AI classification models of granularity in the FST scale. By training multiple AI models to classify benign vs. malignant lesions using FST-specific data of differing granularity, we show that: (i) when training models using FST-specific data based on three groups (FST 1/2, 3/4 and 5/6), performance is generally better for models trained on FST-specific data compared to a general model trained on FST-balanced data; (ii) reducing the granularity of FST scale information (from 1/2 and 3/4 to 1/2/3/4) can have a detrimental effect on performance. Our results highlight the importance of the granularity of FST groups when training lesion classification models. Given the question marks over possible human biases in the choice of categories in the FST scale, this paper provides evidence for a move away from the FST scale in fair AI research and a transition to an alternative scale that better represents the diversity of human skin tones.

Problem

Research questions and friction points this paper is trying to address.

Investigating how Fitzpatrick Skin Tone scale granularity affects AI dermatology model performance

Assessing the impact of reduced skin tone granularity on classification accuracy and fairness

Exploring biases in FST scale categories and advocating for more representative alternatives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used FST-specific data groups for training

Reduced FST granularity to assess performance

Proposed transition to alternative skin tone scale

🔎 Similar Papers

Skin Cancer Machine Learning Model Tone Bias