A Closer Look at Multimodal Representation Collapse

πŸ“… 2025-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses modality collapse in multimodal fusionβ€”a systemic failure where models neglect certain modalities during training. We identify its root cause: noisy features induce representation entanglement via shared neurons, compounded by rank deficiency in the fusion head, leading to degraded modality-specific representations. We provide the first theoretical explanation grounded jointly in representation entanglement and low-rank constraints. Building on this insight, we propose an explicit basis reallocation algorithm that enforces cross-modal disentanglement and dynamic weight calibration, while ensuring robust inference under modality missingness. Our method integrates cross-modal knowledge distillation, neuron-level attribution, and rank-constrained fusion head modeling. Evaluated on multiple benchmarks, it significantly mitigates modality collapse, substantially improving model generalization and stability under partial modality absence.

Technology Category

Application Category

πŸ“ Abstract
We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.
Problem

Research questions and friction points this paper is trying to address.

Understanding modality collapse in multimodal fusion models
Analyzing noisy feature entanglement causing modality collapse
Proposing cross-modal distillation to prevent modality collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal distillation disentangles noisy features
Basis reallocation prevents modality collapse
Frees rank bottlenecks in student encoder
πŸ”Ž Similar Papers
No similar papers found.