On Large-scale Evaluation of Embedding Models for Knowledge Graph Completion

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current knowledge graph embedding (KGE) evaluation suffers from severe biases: unrealistic benchmarks, small or flawed datasets, neglect of mediating node modeling, untested cross-domain generalization, distortions from closed-world assumptions and binary reduction of n-ary relations, and overreliance on MRR that obscures failure modes. This work conducts the first systematic, cross-model evaluation—spanning TransE, RotatE, ComplEx, and TuckER—across two large-scale, realistic datasets (FB-CVT-REV and FB+CVT-REV), under multiple protocols (triplet classification, entity-pair ranking, and attribute prediction). We introduce novel evaluation paradigms: explicit mediating node modeling and attribute-aware assessment. Key findings include: (1) model relative rankings diverge significantly between small and large datasets; (2) binary reduction of n-ary relations inflates performance estimates by up to 37%; and (3) MRR masks over 60% of domain-specific failures. These results expose a substantial performance gap between conventional small-scale benchmarks and real-world deployment scenarios.

Technology Category

Application Category

📝 Abstract
Knowledge graph embedding (KGE) models are extensively studied for knowledge graph completion, yet their evaluation remains constrained by unrealistic benchmarks. Commonly used datasets are either faulty or too small to reflect real-world data. Few studies examine the role of mediator nodes, which are essential for modeling n-ary relationships, or investigate model performance variation across domains. Standard evaluation metrics rely on the closed-world assumption, which penalizes models for correctly predicting missing triples, contradicting the fundamental goals of link prediction. These metrics often compress accuracy assessment into a single value, obscuring models' specific strengths and weaknesses. The prevailing evaluation protocol operates under the unrealistic assumption that an entity's properties, for which values are to be predicted, are known in advance. While alternative protocols such as property prediction, entity-pair ranking and triple classification address some of these limitations, they remain underutilized. This paper conducts a comprehensive evaluation of four representative KGE models on large-scale datasets FB-CVT-REV and FB+CVT-REV. Our analysis reveals critical insights, including substantial performance variations between small and large datasets, both in relative rankings and absolute metrics, systematic overestimation of model capabilities when n-ary relations are binarized, and fundamental limitations in current evaluation protocols and metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating KGE models on unrealistic small benchmarks
Ignoring mediator nodes in n-ary relationship modeling
Flawed metrics penalizing correct missing triple predictions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale datasets FB-CVT-REV and FB+CVT-REV
Examines mediator nodes for n-ary relationships
Alternative evaluation protocols like property prediction
🔎 Similar Papers
No similar papers found.
N
Nasim Shirvani-Mahdavi
University of Texas at Arlington, Arlington TX, 76019, USA
F
F. Akrami
University of Texas at Arlington, Arlington TX, 76019, USA
Chengkai Li
Chengkai Li
Professor of Computer Science and Engineering, The University of Texas at Arlington
Big Data & Data ScienceComputational JournalismData-Driven Fact-CheckingNatural Language Processing