OptiMAG: Structure-Semantic Alignment via Unbalanced Optimal Transport

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the inconsistency between the explicit graph structure and the implicit semantic structures embedded in different modalities of multimodal attributed graphs, which introduces modality-specific noise during message passing and degrades node representation learning. To tackle this issue, the authors propose OptiMAG, a novel framework that, for the first time, incorporates unbalanced optimal transport into multimodal graph learning. By integrating the Gromov-Wasserstein distance, OptiMAG promotes cross-modal structural alignment within local neighborhoods, while simultaneously leveraging KL divergence to adaptively mitigate inter-modal inconsistencies. Designed as a plug-and-play regularizer, OptiMAG requires no modification to the backbone model and thus exhibits strong generality. Extensive experiments on node classification, link prediction, graph2text, and graph2image tasks demonstrate its significant superiority over existing methods, validating both its effectiveness and generalization capability.

Technology Category

Application Category

📝 Abstract

Multimodal Attributed Graphs (MAGs) have been widely adopted for modeling complex systems by integrating multi-modal information, such as text and images, on nodes. However, we identify a discrepancy between the implicit semantic structure induced by different modality embeddings and the explicit graph structure. For instance, neighbors in the explicit graph structure may be close in one modality but distant in another. Since existing methods typically perform message passing over the fixed explicit graph structure, they inadvertently aggregate dissimilar features, introducing modality-specific noise and impeding effective node representation learning. To address this, we propose OptiMAG, an Unbalanced Optimal Transport-based regularization framework. OptiMAG employs the Fused Gromov-Wasserstein distance to explicitly guide cross-modal structural consistency within local neighborhoods, effectively mitigating structural-semantic conflicts. Moreover, a KL divergence penalty enables adaptive handling of cross-modal inconsistencies. This framework can be seamlessly integrated into existing multimodal graph models, acting as an effective drop-in regularizer. Experiments demonstrate that OptiMAG consistently outperforms baselines across multiple tasks, ranging from graph-centric tasks (e.g., node classification, link prediction) to multimodal-centric generation tasks (e.g., graph2text, graph2image). The source code will be available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Attributed Graphs

Structural-Semantic Discrepancy

Cross-Modal Inconsistency

Node Representation Learning

Optimal Transport

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unbalanced Optimal Transport

Fused Gromov-Wasserstein

Multimodal Attributed Graphs