🤖 AI Summary
Existing knowledge graph completion (KGC) evaluation overlooks two critical dimensions: prediction sharpness—the strictness of individual predictions—and robustness to popularity bias—the model’s generalization capability over low-popularity entities. This paper introduces PROBE, the first unified evaluation framework that jointly models both aspects. PROBE employs a Rank Transformer (RT) to dynamically calibrate prediction score strictness and a Popularity-aware Rank Aggregator (RA) to enable fine-grained, fairness-aware score aggregation. Experiments across multiple real-world datasets demonstrate that PROBE effectively mitigates performance overestimation or underestimation induced by popularity bias in conventional metrics (e.g., MRR, Hits@k). It significantly enhances evaluation reliability and model ranking stability, establishing a more scientific, interpretable, and multidimensional benchmark for KGC assessment.
📝 Abstract
Knowledge graph completion (KGC) aims to predict missing facts from the observed KG. While a number of KGC models have been studied, the evaluation of KGC still remain underexplored. In this paper, we observe that existing metrics overlook two key perspectives for KGC evaluation: (A1) predictive sharpness -- the degree of strictness in evaluating an individual prediction, and (A2) popularity-bias robustness -- the ability to predict low-popularity entities. Toward reflecting both perspectives, we propose a novel evaluation framework (PROBE), which consists of a rank transformer (RT) estimating the score of each prediction based on a required level of predictive sharpness and a rank aggregator (RA) aggregating all the scores in a popularity-aware manner. Experiments on real-world KGs reveal that existing metrics tend to over- or under-estimate the accuracy of KGC models, whereas PROBE yields a comprehensive understanding of KGC models and reliable evaluation results.