🤖 AI Summary
Deep learning models for medical imaging often suffer performance degradation in new clinical settings due to distribution shifts—such as changes in imaging devices, patient populations, or acquisition protocols—yet existing stress tests rely on unrealistic perturbations that poorly reflect real-world robustness. This work proposes the first counterfactual stress-testing framework based on causal generative models, which intervenes on variables like scanner type or patient sex to synthesize clinically plausible “what-if” images that preserve anatomical structure while realistically simulating target distribution shifts. Experiments on chest X-ray and mammography datasets demonstrate that this approach more accurately predicts the direction, magnitude, and relative ranking of model performance changes under real out-of-domain conditions compared to conventional perturbation methods, substantially improving the correlation between stress-test outcomes and actual model robustness.
📝 Abstract
Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.