🤖 AI Summary
This study addresses the challenges of dual imbalance—across both sex and disease categories—and sparse pathological signals in multi-class pulmonary disease diagnosis from chest CT scans. To tackle these issues without slice-level annotations, the authors propose an attention-based multiple instance learning framework built upon ConvNeXt that automatically identifies diagnostically critical slices. For the first time in this context, a gradient reversal layer is introduced to mitigate sex-related bias. The approach further integrates focal loss, label smoothing, joint (class, sex) stratified five-fold cross-validation, and oversampling of minority subgroups to enhance model fairness and robustness. Evaluated on the validation set, the method achieves an average competition score of 0.685, with the best single fold reaching 0.759. The implementation code has been made publicly available.
📝 Abstract
We present a fairness-aware framework for multi-class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS-AIF-MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories -- Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma -- with performance measured as the average of per-gender macro F1 scores, explicitly penalizing gender-inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention-based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice-level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender-predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross-validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five-fold checkpoints are ensembled with horizontal-flip test-time augmentation via soft logit voting and out-of-the-fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std - 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at https://github.com/ADE-17/cvpr-fair-chest-ct