Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal graph models struggle to explicitly model cross-modal interactions and alignment, resulting in a semantic gap between modalities. To address this limitation, this work proposes the PLANET framework, which decouples modality interaction and alignment for the first time. Specifically, fine-grained interaction is achieved through a topology-aware Embedding-wise Domain Gating (EDG) mechanism at the embedding level, while global alignment is realized via Node-wise Discretization Retrieval (NDR) in a Discretized Semantic Representation Space (DSRS) at the node level. This dual-level design significantly enhances the representational capacity and generalization performance of multimodal graph foundation models. Extensive experiments demonstrate that PLANET consistently outperforms state-of-the-art methods across various graph-centric tasks and multimodal generation benchmarks.

Technology Category

Application Category

📝 Abstract
Graph Foundation Models (GFMs) have achieved remarkable success in generalizing across diverse domains. However, they mainly focus on Text-Attributed Graphs (TAGs), leaving Multimodal-Attributed Graphs (MAGs) largely untapped. Developing Multimodal Graph Foundation Models (MGFMs) allows for leveraging the rich multimodal information in MAGs, and extends applicability to broader types of downstream tasks. While recent MGFMs integrate diverse modality information, our empirical investigation reveals two fundamental limitations of existing MGFMs: (1)they fail to explicitly model modality interaction, essential for capturing intricate cross-modal semantics beyond simple aggregation, and (2)they exhibit sub-optimal modality alignment, which is critical for bridging the significant semantic disparity between distinct modal spaces. To address these challenges, we propose PLANET (graPh topoLogy-aware modAlity iNteraction and alignmEnT), a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities. At the embedding granularity, (1)Embedding-wise Domain Gating (EDG) performs local semantic enrichment by adaptively infusing topology-aware cross-modal context, achieving modality interaction. At the node granularity, (2)Node-wise Discretization Retrieval (NDR) ensures global modality alignment by constructing a Discretized Semantic Representation Space (DSRS) to bridge modality gaps. Extensive experiments demonstrate that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Graph Foundation Model
Modality Interaction
Modality Alignment
Multimodal-Attributed Graphs
Cross-modal Semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Graph Foundation Model
Modality Interaction
Modality Alignment
Divide-and-Conquer
Graph Topology-aware
🔎 Similar Papers
No similar papers found.
S
Sicheng Liu
Department of XXX, University of YYY , Location, Country
Xunkai Li
Xunkai Li
School of Computer Science and Technology, Beijing Institution of Technology
Data-centric AIGraph MLAI4Science
Daohan Su
Daohan Su
Beijing Institute of Technology
Graph Machine Learning
R
Ru Zhang
Department of XXX, University of YYY , Location, Country
Hongchao Qin
Hongchao Qin
Beijing Institute of Technology
Graph Data Mining
R
Ronghua Li
Department of XXX, University of YYY , Location, Country
Guoren Wang
Guoren Wang
Beijing Institute of Technology