Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine

📅 2024-05-30
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current biomedical multimodal models face critical bottlenecks: reliance on end-to-end training, exponential growth in computational complexity with modality count, severe performance degradation under extreme modality imbalance, and rigid topological coupling. To address these, we propose MM-Lego—a tuning-free, universal multimodal fusion framework. It introduces a novel frequency-domain feature harmonization mechanism that achieves shape alignment and interference-free merging of arbitrary unimodal encoders. We further design modality-agnostic wrappers and zero-/few-shot model merging strategies, enabling topology-agnostic fusion and robust modeling under highly imbalanced modalities. Crucially, MM-Lego requires no fine-tuning yet matches or surpasses end-to-end models in performance, while maintaining full encoder compatibility. Evaluated across seven biomedical benchmark datasets, it achieves state-of-the-art results on five—demonstrating unprecedented flexibility, efficiency, and generalizability in biomedical multimodal learning.

Technology Category

Application Category

📝 Abstract
Learning holistic computational representations in physical, chemical or biological systems requires the ability to process information from different distributions and modalities within the same model. Thus, the demand for multimodal machine learning models has sharply risen for modalities that go beyond vision and language, such as sequences, graphs, time series, or tabular data. While there are many available multimodal fusion and alignment approaches, most of them require end-to-end training, scale quadratically with the number of modalities, cannot handle cases of high modality imbalance in the training set, or are highly topology-specific, making them too restrictive for many biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego), a general-purpose fusion framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning. We achieve this by introducing a wrapper for any unimodal encoder that enforces shape consistency between modality representations. It harmonises these representations by learning features in the frequency domain to enable model merging with little signal interference. We show that MM-Lego 1) can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning, 2) can operate on any unimodal encoder, and 3) is a model fusion method that, with minimal fine-tuning, surpasses all benchmarks in five out of seven datasets.
Problem

Research questions and friction points this paper is trying to address.

Handling diverse data modalities in biomedical machine learning
Overcoming limitations of existing multimodal fusion approaches
Enabling flexible model merging with minimal fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wrapper enforces shape consistency for encoders
Learns features in frequency domain
Enables model merging with minimal fine-tuning
🔎 Similar Papers
No similar papers found.