MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current medical vision-language pretraining methods employ uniform local feature extraction, overlooking intrinsic differences across modalities—such as X-ray, MRI, and CT—in resolution, structural characteristics, and diagnostic semantics, thereby limiting cross-modal alignment performance. To address this, we propose a modality-agnostic, dynamically routed Mixture-of-Experts (MoE) framework that selects multi-scale feature processing paths conditioned solely on report text semantics—without requiring explicit modality labels. We further introduce the first diagnostic-context-driven visual representation disentanglement mechanism, integrating a Swin Transformer backbone, conditional MoE routing, a multi-scale feature pyramid, and spatially adaptive attention. Our method achieves significant improvements on multiple medical image–text alignment and retrieval benchmarks, notably enhancing generalization across heterogeneous modalities and fine-grained diagnostic consistency.

Technology Category

Application Category

📝 Abstract

Different medical imaging modalities capture diagnostic information at varying spatial resolutions, from coarse global patterns to fine-grained localized structures. However, most existing vision-language frameworks in the medical domain apply a uniform strategy for local feature extraction, overlooking the modality-specific demands. In this work, we present MedMoE, a modular and extensible vision-language processing framework that dynamically adapts visual representation based on the diagnostic context. MedMoE incorporates a Mixture-of-Experts (MoE) module conditioned on the report type, which routes multi-scale image features through specialized expert branches trained to capture modality-specific visual semantics. These experts operate over feature pyramids derived from a Swin Transformer backbone, enabling spatially adaptive attention to clinically relevant regions. This framework produces localized visual representations aligned with textual descriptions, without requiring modality-specific supervision at inference. Empirical results on diverse medical benchmarks demonstrate that MedMoE improves alignment and retrieval performance across imaging modalities, underscoring the value of modality-specialized visual representations in clinical vision-language systems.

Problem

Research questions and friction points this paper is trying to address.

Handles varying spatial resolutions in medical imaging modalities

Addresses lack of modality-specific feature extraction in vision-language frameworks

Aligns visual representations with textual descriptions without modality supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-specialized MoE for medical vision-language

Dynamic routing of multi-scale image features

Swin Transformer backbone with adaptive attention

🔎 Similar Papers

No similar papers found.