Are a Thousand Words Better Than a Single Picture? Beyond Images -- A Framework for Multi-Modal Knowledge Graph Dataset Enrichment

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses a key limitation in large-scale multimodal knowledge graphs (MMKGs)—the absence of semantically rich yet visually ambiguous images, such as logos and symbols. To mitigate this, the authors propose the Beyond Images framework, which automatically retrieves entity-related images, converts their visual content into textual descriptions, and leverages large language models to synthesize structured summaries by integrating multi-source textual information, thereby enriching the textual modality. This study presents the first systematic approach to transforming ambiguous visual content into structured semantic representations, complemented by an optional human-in-the-loop verification interface to enhance data reliability—all without modifying existing model architectures. Experiments on three public MMKG datasets demonstrate consistent improvements: Hits@1 increases by up to 7% overall, and for entities with ambiguous visual elements, MRR improves by 201.35% and Hits@1 by 333.33%.

Technology Category

Application Category

📝 Abstract

Multi-Modal Knowledge Graphs (MMKGs) benefit from visual information, yet large-scale image collection is hard to curate and often excludes ambiguous but relevant visuals (e.g., logos, symbols, abstract scenes). We present Beyond Images, an automatic data-centric enrichment pipeline with optional human auditing. This pipeline operates in three stages: (1) large-scale retrieval of additional entity-related images, (2) conversion of all visual inputs into textual descriptions to ensure that ambiguous images contribute usable semantics rather than noise, and (3) fusion of multi-source descriptions using a large language model (LLM) to generate concise, entity-aligned summaries. These summaries replace or augment the text modality in standard MMKG models without changing their architectures or loss functions. Across three public MMKG datasets and multiple baseline models, we observe consistent gains (up to 7% Hits@1 overall). Furthermore, on a challenging subset of entities with visually ambiguous logos and symbols, converting images into text yields large improvements (201.35% MRR and 333.33% Hits@1). Additionally, we release a lightweight Text-Image Consistency Check Interface for optional targeted audits, improving description quality and dataset reliability. Our results show that scaling image coverage and converting ambiguous visuals into text is a practical path to stronger MMKG completion. Code, datasets, and supplementary materials are available at https://github.com/pengyu-zhang/Beyond-Images.

Problem

Research questions and friction points this paper is trying to address.

Multi-Modal Knowledge Graphs

image ambiguity

visual information

dataset enrichment

knowledge graph completion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Modal Knowledge Graph

Image-to-Text Conversion

Large Language Model