MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Transformer-based multimodal fusion methods implicitly model cross-modal correlations, lacking explicit characterization of modality-specific features and complex structural relationships. To address this, we propose MANGO: a multimodal attention fusion framework grounded in invertible normalizing flows. MANGO introduces the first invertible cross-attention layer and designs three novel cross-attention mechanisms—Multimodal Cross-Attention (MMCA), Interleaved Cross-Attention (IMCA), and Latent-Informed Cross-Attention (LICA)—to ensure explicit, interpretable, and traceable fusion. By integrating normalizing flows, invertible neural networks, and multimodal representation learning, MANGO establishes an end-to-end differentiable architecture. Evaluated on semantic segmentation, image translation, and movie genre classification, it achieves state-of-the-art performance, demonstrating superior effectiveness and scalability for high-dimensional multimodal data modeling.

Technology Category

Application Category

📝 Abstract
Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approachfootnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.
Problem

Research questions and friction points this paper is trying to address.

Improves multimodal fusion by explicit, interpretable feature learning
Captures complex correlations in multimodal data via novel cross-attention
Enables scalable high-dimensional multimodal data processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Invertible Cross-Attention layer for Normalizing Flow
Three novel cross-attention mechanisms for multimodal fusion
Multimodal Attention-based Normalizing Flow for scalability
🔎 Similar Papers
No similar papers found.