MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses two key limitations in medical image segmentation: inadequate multi-scale feature modeling and insufficient cross-modal semantic guidance. To this end, we propose the first dynamic Mixture-of-Experts (MoE) framework integrated with vision-language models. Methodologically, we design a multi-scale visual encoder jointly routed with clinical text embeddings, enabling input-adaptive expert selection. Crucially, we introduce clinical textual descriptions into the MoE architecture for medical segmentation—marking the first such integration—and build an end-to-end vision-language collaborative segmentation model. Evaluated on ten public medical datasets comprising 3,410 CT scans, our approach achieves significant improvements in segmentation accuracy and cross-dataset generalization. The results establish a novel paradigm for deploying vision-language foundation models in precision medical imaging analysis.

Technology Category

Application Category

📝 Abstract

In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

Problem

Research questions and friction points this paper is trying to address.

Dynamic expert selection for medical image segmentation

Integrating vision-language models with multi-scale features

Leveraging textual embeddings to enhance segmentation precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts adapted for medical vision-language tasks

Dynamic expert selection using multi-scale visual features

Integration of textual embeddings with medical imaging data

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis