🤖 AI Summary
Medical imaging fine-tuning faces challenges including confounding variable interference, high annotation costs, and privacy constraints. To address these, we propose a confounder-aware data selection method that constructs a minimal yet representative subset under a fixed annotation budget while preserving the original data distribution. Our key contribution is the first integration of causal confounder identification with distance-driven greedy sampling—jointly optimizing for both causal robustness and distributional fidelity. The method comprises three components: confounder detection, metric learning for confounder-adjusted similarity, and a multimodal validation framework. Extensive evaluation across diverse medical imaging tasks demonstrates that, at equal annotation budgets, our approach achieves an average accuracy improvement of 4.2%, reduces confounding bias by 37.6%, and significantly enhances fine-tuning efficiency and out-of-distribution generalization.
📝 Abstract
The emergence of large-scale pre-trained vision foundation models has greatly advanced the medical imaging field through the pre-training and fine-tuning paradigm. However, selecting appropriate medical data for downstream fine-tuning remains a significant challenge considering its annotation cost, privacy concerns, and the detrimental effects of confounding variables. In this work, we present a confounder-aware medical data selection approach for medical dataset curation aiming to select minimal representative data by strategically mitigating the undesirable impact of confounding variables while preserving the natural distribution of the dataset. Our approach first identifies confounding variables within data and then develops a distance-based data selection strategy for confounder-aware sampling with a constrained budget in the data size. We validate the superiority of our approach through extensive experiments across diverse medical imaging modalities, highlighting its effectiveness in addressing the substantial impact of confounding variables and enhancing the fine-tuning efficiency in the medical imaging domain, compared to other data selection approaches.