π€ AI Summary
Text-guided medical image segmentation suffers from misalignment between visual and linguistic modalities and spurious correlations induced by confounding biases. Method: This work pioneers the integration of causal inference into this task, proposing a cross-modal decoding adaptation framework to transfer generic CLIP models to the medical domain; designing a causal intervention module that enables robust text-to-pixel alignment via self-annotated confounder estimation and causal feature mining; and establishing an adversarial min-max optimization mechanism to explicitly suppress non-causal associations. Contribution/Results: The method achieves state-of-the-art performance across multiple medical referring segmentation benchmarks, significantly improving segmentation accuracy and cross-domain generalization. It establishes a novel paradigm for interpretable, causally reliable vision-language understanding in medicine.
π Abstract
Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.