VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing knowledge graph embedding (KGE) methods struggle to effectively model multimodal entities and often process different modalities in isolation, resulting in weak cross-modal alignment and overly simplified semantic assumptions. This work proposes the first end-to-end joint representation learning framework that integrates vision-language models (VLMs) into multimodal KGE, leveraging VLMs’ inherent cross-modal alignment capabilities together with relational structure modeling from knowledge graphs to overcome modality isolation. Experimental results on WN9-IMG and two newly constructed art-domain multimodal knowledge graphs demonstrate that the proposed approach significantly outperforms current unimodal and multimodal KGE methods on link prediction tasks.

Technology Category

Application Category

📝 Abstract
Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.
Problem

Research questions and friction points this paper is trying to address.

multimodal knowledge graphs
knowledge graph embeddings
cross-modal alignment
vision-language models
heterogeneous entities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Knowledge Graph Embeddings
Multimodal Representation
Cross-modal Alignment
Link Prediction
🔎 Similar Papers
No similar papers found.
A
Athanasios Efthymiou
University of Amsterdam
Stevan Rudinac
Stevan Rudinac
Associate Professor, University of Amsterdam
multimediacomputer visioninformation retrievalmachine learning
M
Monika Kackovic
University of Amsterdam
Nachoem Wijnberg
Nachoem Wijnberg
Unknown affiliation
M
Marcel Worring
University of Amsterdam