Impacts of Data Splitting Strategies on Parameterized Link Prediction Algorithms

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper identifies an information leakage problem in link prediction caused by improper use of the test set during hyperparameter tuning, leading to inflated model performance estimates. To quantify this bias, we propose a novel evaluation metric—Loss Ratio—and conduct a large-scale empirical study across 60 real-world networks using diverse parametric models. Results show that average performance is overestimated by 3.6% on average, with some algorithms exhibiting biases exceeding 15%. Heuristic and random-walk-based methods demonstrate greater robustness to such leakage. The study systematically establishes the necessity of standardized data splitting and evaluation protocols, providing both theoretical grounding and practical guidelines for trustworthy link prediction model assessment.

Technology Category

Application Category

📝 Abstract

Link prediction is a fundamental problem in network science, aiming to infer potential or missing links based on observed network structures. With the increasing adoption of parameterized models, the rigor of evaluation protocols has become critically important. However, a previously common practice of using the test set during hyperparameter tuning has led to human-induced information leakage, thereby inflating the reported model performance. To address this issue, this study introduces a novel evaluation metric, Loss Ratio, which quantitatively measures the extent of performance overestimation. We conduct large-scale experiments on 60 real-world networks across six domains. The results demonstrate that the information leakage leads to an average overestimation about 3.6%, with the bias reaching over 15% for specific algorithms. Meanwhile, heuristic and random-walk-based methods exhibit greater robustness and stability. The analysis uncovers a pervasive information leakage issue in link prediction evaluation and underscores the necessity of adopting standardized data splitting strategies to enable fair and reproducible benchmarking of link prediction models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating information leakage in link prediction model performance assessment

Quantifying performance overestimation bias from improper data splitting

Establishing standardized evaluation protocols for reproducible network analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Loss Ratio metric for overestimation measurement

Conducts large-scale experiments on 60 real-world networks

Advocates standardized data splitting to prevent information leakage

🔎 Similar Papers

No similar papers found.