CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention

📅 2025-03-20

🏛️ International Conference on Medical Image Computing and Computer-Assisted Intervention

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Text-guided medical image segmentation suffers from misalignment between visual and linguistic modalities and spurious correlations induced by confounding biases. Method: This work pioneers the integration of causal inference into this task, proposing a cross-modal decoding adaptation framework to transfer generic CLIP models to the medical domain; designing a causal intervention module that enables robust text-to-pixel alignment via self-annotated confounder estimation and causal feature mining; and establishing an adversarial min-max optimization mechanism to explicitly suppress non-causal associations. Contribution/Results: The method achieves state-of-the-art performance across multiple medical referring segmentation benchmarks, significantly improving segmentation accuracy and cross-domain generalization. It establishes a novel paradigm for interpretable, causally reliable vision-language understanding in medicine.

Technology Category

Application Category

📝 Abstract

Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.

Problem

Research questions and friction points this paper is trying to address.

Aligning visual and textual cues in medical image segmentation.

Mitigating confounding bias in segmentation judgments.

Leveraging CLIP for text-to-pixel alignment in medical domain.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages CLIP for medical image segmentation

Introduces causal intervention to reduce bias

Uses adversarial min-max game for optimization

🔎 Similar Papers

MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation