🤖 AI Summary
In psoriasis clinical trials, automatic severity scoring from remote smartphone images is vulnerable to spurious correlations induced by confounding factors—including variable lighting, heterogeneous backgrounds, device-specific artifacts, and inter-rater annotation inconsistencies. To address this, we propose an unsupervised training-sample diagnosis method grounded in gradient-based interpretability—enabling precise identification of pseudo-patterns and annotation-conflict samples without additional labels. Our approach integrates a ConvNeXT-based weakly supervised architecture with gradient溯源 analysis to support counterfactual attribution for misclassified samples and automated flagging of problematic images. Removing only 8.2% of low-quality or high-conflict samples improves test-set AUC-ROC by 5 percentage points (85% → 90%). On a dual-physician-annotated subset, the top 30% highest-risk samples identified by our method encompass over 90% of annotation disagreements—demonstrating substantial gains in model robustness and clinical reliability.
📝 Abstract
Psoriasis (PsO) severity scoring is important for clinical trials but is hindered by inter-rater variability and the burden of in person clinical evaluation. Remote imaging using patient captured mobile photos offers scalability but introduces challenges, such as variation in lighting, background, and device quality that are often imperceptible to humans but can impact model performance. These factors, along with inconsistencies in dermatologist annotations, reduce the reliability of automated severity scoring. We propose a framework to automatically flag problematic training images that introduce spurious correlations which degrade model generalization, using a gradient based interpretability approach. By tracing the gradients of misclassified validation images, we detect training samples where model errors align with inconsistently rated examples or are affected by subtle, nonclinical artifacts. We apply this method to a ConvNeXT based weakly supervised model designed to classify PsO severity from phone images. Removing 8.2% of flagged images improves model AUC-ROC by 5% (85% to 90%) on a held out test set. Commonly, multiple annotators and an adjudication process ensure annotation accuracy, which is expensive and time consuming. Our method detects training images with annotation inconsistencies, potentially removing the need for manual review. When applied to a subset of training data rated by two dermatologists, the method identifies over 90% of cases with inter-rater disagreement by reviewing only the top 30% of samples. This improves automated scoring for remote assessments, ensuring robustness despite data collection variability.