Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

In multimodal learning, aligning marginal distributions across modalities often causes over-regularization, degrading intra-modal feature expressiveness. To address this, we propose the first Variational Dirichlet Process (VDP)-based multimodal fusion framework, modeling each modality as a Gaussian mixture with adaptively determined component count. Leveraging the “richer-gets-richer” prior of the Dirichlet process, our method dynamically amplifies salient features while suppressing noise components—enabling joint optimization of intra-modal representation preservation and cross-modal interaction without stringent alignment constraints. This work pioneers the integration of Dirichlet processes into multimodal learning, achieving automatic balancing between feature selection and fusion. Extensive experiments demonstrate significant improvements over state-of-the-art methods on multiple benchmark datasets. Ablation studies confirm robustness to hyperparameters and validate that explicit distributional modeling substantially enhances cross-modal semantic consistency.

Technology Category

Application Category

📝 Abstract

Developing effective multimodal fusion approaches has become increasingly essential in many real-world scenarios, such as health care and finance. The key challenge is how to preserve the feature expressiveness in each modality while learning cross-modal interactions. Previous approaches primarily focus on the cross-modal alignment, while over-emphasis on the alignment of marginal distributions of modalities may impose excess regularization and obstruct meaningful representations within each modality. The Dirichlet process (DP) mixture model is a powerful Bayesian non-parametric method that can amplify the most prominent features by its richer-gets-richer property, which allocates increasing weights to them. Inspired by this unique characteristic of DP, we propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment. Specifically, we assume that each modality follows a mixture of multivariate Gaussian distributions and further adopt DP to calculate the mixture weights for all the components. This paradigm allows DP to dynamically allocate the contributions of features and select the most prominent ones, leveraging its richer-gets-richer property, thus facilitating multimodal feature fusion. Extensive experiments on several multimodal datasets demonstrate the superior performance of our model over other competitors. Ablation analysis further validates the effectiveness of DP in aligning modality distributions and its robustness to changes in key hyperparameters. Code is anonymously available at https://github.com/HKU-MedAI/DPMM.git

Problem

Research questions and friction points this paper is trying to address.

Balancing intra-modal representation learning with cross-modal alignment

Amplifying prominent features using variational Dirichlet process properties

Addressing over-regularization in multimodal fusion while preserving feature expressiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Variational Dirichlet Process for multimodal feature fusion

Dynamically allocates weights to prominent intra-modal representations

Balances intra-modal learning with cross-modal alignment automatically

🔎 Similar Papers

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification