Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation

πŸ“… 2025-09-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In medical image segmentation, multimodal models suffer from insufficient generalization due to semantic gaps and feature discreteness between abstract textual prompts and fine-grained visual features. To address this, we propose EM-CLIPβ€”a cross-modal alignment framework integrating Expectation-Maximization (EM) clustering with text-guided decoding. Its core innovations are: (1) dynamic EM clustering to compactly aggregate visual features into transferable, domain-invariant semantic centroids; and (2) a text-guided pixel-level decoder that leverages linguistic priors to modulate visual attention, explicitly bridging the modality-level semantic gap. Evaluated on multiple multi-center cardiac and fundus datasets, EM-CLIP consistently outperforms state-of-the-art methods, demonstrating superior robustness and generalization in cross-domain segmentation tasks.

Technology Category

Application Category

πŸ“ Abstract
Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal fusion, primarily the significant semantic gap between abstract textual prompts and fine-grained medical visual features, as well as the resulting feature dispersion. To address these issues, we revisit the problem from the perspective of semantic aggregation. Specifically, we propose an Expectation-Maximization (EM) Aggregation mechanism and a Text-Guided Pixel Decoder. The former mitigates feature dispersion by dynamically clustering features into compact semantic centers to enhance cross-modal correspondence. The latter is designed to bridge the semantic gap by leveraging domain-invariant textual knowledge to effectively guide deep visual representations. The synergy between these two mechanisms significantly improves the model's generalization ability. Extensive experiments on public cardiac and fundus datasets demonstrate that our method consistently outperforms existing SOTA approaches across multiple domain generalization benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Bridging semantic gap between text and medical images
Addressing feature dispersion in multimodal medical segmentation
Improving generalization of medical image segmentation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expectation-Maximization Aggregation mechanism clustering features
Text-Guided Pixel Decoder bridging semantic gap
Leveraging domain-invariant textual knowledge guidance
Wenjun Yu
Wenjun Yu
Shanghai University of International Business and Economics
Time Series Analysis
Y
Yinchen Zhou
J
Jia-Xuan Jiang
S
Shubin Zeng
Y
Yuee Li
Z
Zhong Wang