🤖 AI Summary
This study conducts the first large-scale reproducibility audit of graph neural network (GNN)-based recommender systems presented at SIGIR 2022, focusing on ten message-passing papers. It identifies four critical methodological flaws: erroneous data splitting, train/test information leakage, inconsistencies between paper descriptions and open-sourced code/data, and biased baseline selection—particularly the omission of simple yet effective baselines. Through empirical reimplementation, data pipeline auditing, fair baseline evaluation, and artifact consistency verification, the study uncovers pervasive issues—including data leakage, improper splits, and description-code mismatches. On Amazon-Book, most claimed state-of-the-art (SOTA) improvements prove illusory; several models even underperform standard baselines. The findings expose fundamental methodological weaknesses and a credibility crisis in current GNN recommendation research. The work contributes a systematic, multi-dimensional reproducibility assessment framework and actionable recommendations to strengthen empirical rigor and reporting standards in recommender systems research.
📝 Abstract
Graph-based techniques relying on neural networks and embeddings have gained attention as a way to develop Recommender Systems (RS) with several papers on the topic presented at SIGIR 2022 and 2023. Given the importance of ensuring that published research is methodologically sound and reproducible, in this paper we analyze 10 graph-based RS papers, most of which were published at SIGIR 2022, and assess their impact on subsequent work published in SIGIR 2023. Our analysis reveals several critical points that require attention: (i) the prevalence of bad practices, such as erroneous data splits or information leakage between training and testing data, which call into question the validity of the results; (ii) frequent inconsistencies between the provided artifacts (source code and data) and their descriptions in the paper, causing uncertainty about what is actually being evaluated; and (iii) the preference for new or complex baselines that are weaker compared to simpler ones, creating the impression of continuous improvement even when, particularly for the Amazon-Book dataset, the state-of-the-art has significantly worsened. Due to these issues, we are unable to confirm the claims made in most of the papers we examined and attempted to reproduce.