VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing knowledge graph embedding (KGE) methods struggle to effectively model multimodal entities and often process different modalities in isolation, resulting in weak cross-modal alignment and overly simplified semantic assumptions. This work proposes the first end-to-end joint representation learning framework that integrates vision-language models (VLMs) into multimodal KGE, leveraging VLMs’ inherent cross-modal alignment capabilities together with relational structure modeling from knowledge graphs to overcome modality isolation. Experimental results on WN9-IMG and two newly constructed art-domain multimodal knowledge graphs demonstrate that the proposed approach significantly outperforms current unimodal and multimodal KGE methods on link prediction tasks.

Technology Category

Application Category

📝 Abstract

Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.

Problem

Research questions and friction points this paper is trying to address.

multimodal knowledge graphs

knowledge graph embeddings

cross-modal alignment

vision-language models

heterogeneous entities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Knowledge Graph Embeddings

Multimodal Representation