Are We Really Measuring Progress? Transferring Insights from Evaluating Recommender Systems to Temporal Link Prediction

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies three systematic biases undermining the validity of temporal link prediction (TLP) evaluation: inconsistent sampling-based metrics, overuse of hard negative sampling, and neglect of source-node baseline probabilities. Method: Drawing on well-established evaluation reliability theory from recommender systems, we propose a principled evaluation reconstruction framework that jointly ensures baseline fairness and sampling consistency. We formally diagnose existing benchmark protocols, analyze root causes through problem decomposition and cross-domain analogy, and derive actionable mitigation strategies. Contribution/Results: Our work establishes the first theoretically grounded, domain-adapted evaluation framework for TLP. It clarifies the origins of evaluation distortion, provides both theoretical justification and concrete guidelines for building more robust, interpretable, and reproducible TLP benchmarks, and significantly enhances the scientific rigor and credibility of model comparisons.

Technology Category

Application Category

📝 Abstract
Recent work has questioned the reliability of graph learning benchmarks, citing concerns around task design, methodological rigor, and data suitability. In this extended abstract, we contribute to this discussion by focusing on evaluation strategies in Temporal Link Prediction (TLP). We observe that current evaluation protocols are often affected by one or more of the following issues: (1) inconsistent sampled metrics, (2) reliance on hard negative sampling often introduced as a means to improve robustness, and (3) metrics that implicitly assume equal base probabilities across source nodes by combining predictions. We support these claims through illustrative examples and connections to longstanding concerns in the recommender systems community. Our ongoing work aims to systematically characterize these problems and explore alternatives that can lead to more robust and interpretable evaluation. We conclude with a discussion of potential directions for improving the reliability of TLP benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reliability of Temporal Link Prediction benchmarks
Addressing inconsistent metrics and negative sampling issues
Improving robustness and interpretability in evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Focus on evaluation strategies in TLP
Address inconsistent sampled metrics issues
Explore robust interpretable evaluation alternatives