Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical image segmentation suffers from severe domain generalization challenges due to domain shifts induced by confounding factors—such as scanner heterogeneity and imaging artifacts—leading to substantial performance degradation of vision-language models (e.g., CLIP) across domains. To address this, we propose the first multimodal domain generalization framework integrating causal inference with vision-language modeling. Our method constructs an interpretable confounder dictionary via text prompts and employs a causal intervention network to disentangle anatomical representations from domain-specific biases. Crucially, we introduce counterfactual reasoning into CLIP-driven segmentation for the first time, explicitly suppressing non-anatomical confounders—including scanner type and imaging modality. Evaluated on multiple cross-domain medical imaging benchmarks, our approach achieves state-of-the-art zero-shot segmentation performance, significantly outperforming existing methods while ensuring robustness and interpretability.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
Problem

Research questions and friction points this paper is trying to address.

Address domain shifts in medical image segmentation
Eliminate confounders' impact on segmentation accuracy
Enhance generalization across unseen medical domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages CLIP for lesion region identification
Constructs confounder dictionary via text prompts
Trains causal network to eliminate domain variations
🔎 Similar Papers
No similar papers found.