Graph4MM: Weaving Multimodal Learning with Structural Information

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world multimodal data exhibit complex cross-modal structural relationships—such as coreference and contextual dependencies—that extend far beyond simple image–text alignment. Existing methods treat graphs as isolated modalities and neglect multi-hop neighbor interactions, leading to fragmented semantic understanding. To address this, we propose a unified framework that jointly optimizes multi-hop graph structural modeling and multimodal fusion. Specifically, we introduce Hop-Diffused Attention to explicitly distinguish multi-hop neighbors and design MM-QFormer for principled, modality-aware fusion. Our architecture integrates graph neural networks, causal masking, diffusion-based mechanisms, and multi-mapping query Transformers. Evaluated on both generative and discriminative multimodal tasks, our model—despite its smaller scale—outperforms state-of-the-art vision-language models and multimodal graph models, achieving an average performance gain of 6.93%.

Technology Category

Application Category

📝 Abstract
Real-world multimodal data usually exhibit complex structural relationships beyond traditional one-to-one mappings like image-caption pairs. Entities across modalities interact in intricate ways, with images and text forming diverse interconnections through contextual dependencies and co-references. Graphs provide powerful structural information for modeling intra-modal and inter-modal relationships. However, previous works fail to distinguish multi-hop neighbors and treat the graph as a standalone modality, which fragments the overall understanding. This limitation presents two key challenges in multimodal learning: (1) integrating structural information from multi-hop neighbors into foundational models, and (2) fusing modality-specific information in a principled manner. To address these challenges, we revisit the role of graphs in multimodal learning within the era of foundation models and propose Graph4MM, a graph-based multimodal learning framework. To be specific, we introduce Hop-Diffused Attention, which integrates multi-hop structural information into self-attention through causal masking and hop diffusion. Furthermore, we design MM-QFormer, a multi-mapping querying transformer for cross-modal fusion. Through theoretical and empirical analysis, we show that leveraging structures to integrate both intra- and inter-modal interactions improves multimodal understanding beyond treating them as a standalone modality. Experiments on both generative and discriminative tasks show that Graph4MM outperforms larger VLMs, LLMs, and multimodal graph baselines, achieving a 6.93% average improvement.
Problem

Research questions and friction points this paper is trying to address.

Modeling complex structural relationships in multimodal data
Integrating multi-hop graph information into foundation models
Fusing modality-specific information through principled cross-modal interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hop-Diffused Attention integrates multi-hop structural information
MM-QFormer transformer enables principled cross-modal fusion
Graph framework combines intra-modal and inter-modal interactions
🔎 Similar Papers
No similar papers found.