Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of gradient conflict and representation collapse in self-supervised training of multimodal medical vision foundation models, which arises from cross-modal non-independent and identically distributed features. To mitigate this issue, the authors propose Director-Experts (DEX), a modular architecture that, for the first time, formulates emergent modularity as a dynamic equilibrium between specialization and collaboration. This is achieved through an image-level expert activation strategy and a Director mechanism based on group exponential moving average, which jointly encourage the natural emergence of modular representations. Pretrained on the Medical Vision Universe dataset encompassing ten imaging modalities, DEX demonstrates superior optimization behavior and transfer performance across 26 downstream tasks, significantly enhancing the generalization capability of multimodal medical AI systems.

📝 Abstract

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe-list/DEX.

Problem

Research questions and friction points this paper is trying to address.

multi-modality

medical vision

Non-IID

representation collapse

modular representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

modular representation

multi-modality medical vision

Director-Experts