Countering Multi-modal Representation Collapse through Rank-targeted Fusion

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Multimodal fusion in sensor-rich tasks—such as human action prediction—commonly suffers from two intertwined degradation phenomena: *feature collapse* (loss of discriminative power in individual feature dimensions) and *modality collapse* (dominant modalities suppressing others). To address both simultaneously, this paper introduces the first unified optimization framework grounded in *effective rank*, a spectral measure that jointly quantifies both collapse types. We propose the Rank-enhancing Token Fuser architecture and a cross-modal effective-rank mutual enhancement strategy to achieve selective and balanced fusion. Theoretical analysis establishes the framework’s ability to jointly mitigate both collapses. Extensive experiments on NTU-RGBD, UT-Kinect, and DARai demonstrate state-of-the-art performance in action anticipation, with absolute accuracy gains of up to 3.74% over prior methods.

Technology Category

Application Category

📝 Abstract

Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose extit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others'effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present exttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74%. Our code is available at: href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.

Problem

Research questions and friction points this paper is trying to address.

Addressing feature collapse where dimensions lose discriminative power in multi-modal fusion

Solving modality collapse where one dominant modality overwhelms others during fusion

Developing unified framework to counter both representation collapses simultaneously

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses effective rank to quantify representation collapse

Proposes Rank-enhancing Token Fuser framework

Selectively blends complementary cross-modal features

🔎 Similar Papers

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

2024-09-26Trans. Mach. Learn. Res.Citations: 0

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)