Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing multimodal graph models struggle to explicitly model cross-modal interactions and alignment, resulting in a semantic gap between modalities. To address this limitation, this work proposes the PLANET framework, which decouples modality interaction and alignment for the first time. Specifically, fine-grained interaction is achieved through a topology-aware Embedding-wise Domain Gating (EDG) mechanism at the embedding level, while global alignment is realized via Node-wise Discretization Retrieval (NDR) in a Discretized Semantic Representation Space (DSRS) at the node level. This dual-level design significantly enhances the representational capacity and generalization performance of multimodal graph foundation models. Extensive experiments demonstrate that PLANET consistently outperforms state-of-the-art methods across various graph-centric tasks and multimodal generation benchmarks.

Technology Category

Application Category

📝 Abstract

Graph Foundation Models (GFMs) have achieved remarkable success in generalizing across diverse domains. However, they mainly focus on Text-Attributed Graphs (TAGs), leaving Multimodal-Attributed Graphs (MAGs) largely untapped. Developing Multimodal Graph Foundation Models (MGFMs) allows for leveraging the rich multimodal information in MAGs, and extends applicability to broader types of downstream tasks. While recent MGFMs integrate diverse modality information, our empirical investigation reveals two fundamental limitations of existing MGFMs: (1)they fail to explicitly model modality interaction, essential for capturing intricate cross-modal semantics beyond simple aggregation, and (2)they exhibit sub-optimal modality alignment, which is critical for bridging the significant semantic disparity between distinct modal spaces. To address these challenges, we propose PLANET (graPh topoLogy-aware modAlity iNteraction and alignmEnT), a novel framework employing a Divide-and-Conquer strategy to decouple modality interaction and alignment across distinct granularities. At the embedding granularity, (1)Embedding-wise Domain Gating (EDG) performs local semantic enrichment by adaptively infusing topology-aware cross-modal context, achieving modality interaction. At the node granularity, (2)Node-wise Discretization Retrieval (NDR) ensures global modality alignment by constructing a Discretized Semantic Representation Space (DSRS) to bridge modality gaps. Extensive experiments demonstrate that PLANET significantly outperforms state-of-the-art baselines across diverse graph-centric and multimodal generative tasks.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Graph Foundation Model

Modality Interaction

Modality Alignment

Multimodal-Attributed Graphs

Cross-modal Semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Graph Foundation Model

Modality Interaction

Modality Alignment