MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key limitations in medical image segmentation: inadequate multi-scale feature modeling and insufficient cross-modal semantic guidance. To this end, we propose the first dynamic Mixture-of-Experts (MoE) framework integrated with vision-language models. Methodologically, we design a multi-scale visual encoder jointly routed with clinical text embeddings, enabling input-adaptive expert selection. Crucially, we introduce clinical textual descriptions into the MoE architecture for medical segmentation—marking the first such integration—and build an end-to-end vision-language collaborative segmentation model. Evaluated on ten public medical datasets comprising 3,410 CT scans, our approach achieves significant improvements in segmentation accuracy and cross-dataset generalization. The results establish a novel paradigm for deploying vision-language foundation models in precision medical imaging analysis.

Technology Category

Application Category

📝 Abstract
In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.
Problem

Research questions and friction points this paper is trying to address.

Dynamic expert selection for medical image segmentation
Integrating vision-language models with multi-scale features
Leveraging textual embeddings to enhance segmentation precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts adapted for medical vision-language tasks
Dynamic expert selection using multi-scale visual features
Integration of textual embeddings with medical imaging data
🔎 Similar Papers
No similar papers found.
Arghavan Rezvani
Arghavan Rezvani
PhD student, University of California Irvine
Deep LearningHealthcare
Xiangyi Yan
Xiangyi Yan
Department of Computer Science, University of California, Irvine
A
Anthony T. Wu
Department of Computer Science, University of California, Irvine; School of Medicine, University of California, Irvine
K
Kun Han
Department of Computer Science, University of California, Irvine
P
Pooya Khosravi
Department of Computer Science, University of California, Irvine; School of Medicine, University of California, Irvine
X
Xiaohui Xie
Department of Computer Science, University of California, Irvine