🤖 AI Summary
This work addresses two key limitations in medical image segmentation: inadequate multi-scale feature modeling and insufficient cross-modal semantic guidance. To this end, we propose the first dynamic Mixture-of-Experts (MoE) framework integrated with vision-language models. Methodologically, we design a multi-scale visual encoder jointly routed with clinical text embeddings, enabling input-adaptive expert selection. Crucially, we introduce clinical textual descriptions into the MoE architecture for medical segmentation—marking the first such integration—and build an end-to-end vision-language collaborative segmentation model. Evaluated on ten public medical datasets comprising 3,410 CT scans, our approach achieves significant improvements in segmentation accuracy and cross-dataset generalization. The results establish a novel paradigm for deploying vision-language foundation models in precision medical imaging analysis.
📝 Abstract
In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.