UniGraph2: Learning a Unified Embedding Space to Bind Multimodal Graphs

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing multimodal models (e.g., CLIP) neglect graph-structured information, limiting their ability to capture complex cross-modal relationships and topological structures inherent in multimodal graphs (MMGs). To address this, we propose the first cross-domain graph foundation model tailored for MMGs, integrating modality-specific encoders, graph neural networks (GNNs), and a Mixture of Experts (MoE) architecture to jointly learn a unified low-dimensional embedding space that fuses multimodal semantics and graph topology. We introduce a novel cross-domain multi-graph self-supervised pretraining paradigm—the first to enable general-purpose representation learning, zero-shot cross-domain transfer, and generative modeling on MMGs. Our model achieves state-of-the-art performance across MMG representation learning, cross-domain transfer, and generation tasks, while demonstrating strong scalability and cross-modal robustness.

Technology Category

Application Category

📝 Abstract

Existing foundation models, such as CLIP, aim to learn a unified embedding space for multimodal data, enabling a wide range of downstream web-based applications like search, recommendation, and content classification. However, these models often overlook the inherent graph structures in multimodal datasets, where entities and their relationships are crucial. Multimodal graphs (MMGs) represent such graphs where each node is associated with features from different modalities, while the edges capture the relationships between these entities. On the other hand, existing graph foundation models primarily focus on text-attributed graphs (TAGs) and are not designed to handle the complexities of MMGs. To address these limitations, we propose UniGraph2, a novel cross-domain graph foundation model that enables general representation learning on MMGs, providing a unified embedding space. UniGraph2 employs modality-specific encoders alongside a graph neural network (GNN) to learn a unified low-dimensional embedding space that captures both the multimodal information and the underlying graph structure. We propose a new cross-domain multi-graph pre-training algorithm at scale to ensure effective transfer learning across diverse graph domains and modalities. Additionally, we adopt a Mixture of Experts (MoE) component to align features from different domains and modalities, ensuring coherent and robust embeddings that unify the information across modalities. Extensive experiments on a variety of multimodal graph tasks demonstrate that UniGraph2 significantly outperforms state-of-the-art models in tasks such as representation learning, transfer learning, and multimodal generative tasks, offering a scalable and flexible solution for learning on MMGs.

Problem

Research questions and friction points this paper is trying to address.

Multi-modal Graphs

CLIP Model

Structural Information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Graphs

Graph Neural Networks

Mixture of Experts

🔎 Similar Papers

No similar papers found.