MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the performance degradation of multimodal models in domain generalization caused by distribution shifts between training and testing environments, particularly the fusion overfitting induced by spurious cross-modal co-occurrence statistics. To mitigate this issue, the authors propose a modality entropy regularization mechanism that maximizes the entropy of feature distributions from individual modality encoders, thereby preserving feature diversity and reducing over-reliance on source-domain-specific cross-modal correlations. The proposed regularizer is architecture-agnostic and plug-and-play, requiring no modification to the backbone network. Evaluated on the EPIC-Kitchens and HAC benchmarks, the method achieves an average performance gain of approximately 5% over standard fusion and outperforms the current state-of-the-art by about 2%, offering the first systematic identification and effective mitigation of overfitting in multimodal fusion.
📝 Abstract
Deploying multimodal models in real-world scenarios requires generalization to new environments where recording conditions differ from training, a challenge known as multimodal domain generalization (MMDG). Standard architectures employ separate encoders for each modality and a fusion module, training the system end-to-end by optimizing on the fused features. In this paper, we identify that such joint optimization causes encoders to exploit cross-modal co-occurrences, statistical relationships between modalities that arise from source-specific recording conditions, rather than learning domain-invariant features. We term this failure mode Fusion Overfitting. To address this, we propose Modality-Entropy Regularization for Domain Generalization (MER-DG), which maximizes the entropy of each encoder's feature distribution to preserve feature diversity. MER-DG is architecture-agnostic and integrates into existing multimodal frameworks as an additive loss term. Extensive experiments on EPIC-Kitchens and HAC benchmarks demonstrate average improvements of approximately 5% over standard fusion and approximately 2% over state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

multimodal domain generalization
fusion overfitting
domain-invariant features
cross-modal co-occurrences
modality encoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Entropy Regularization
Multimodal Domain Generalization
Fusion Overfitting
Feature Diversity
Domain-Invariant Representation
🔎 Similar Papers
No similar papers found.