Reproducibility and Artifact Consistency of the SIGIR 2022 Recommender Systems Papers Based on Message Passing

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study conducts the first large-scale reproducibility audit of graph neural network (GNN)-based recommender systems presented at SIGIR 2022, focusing on ten message-passing papers. It identifies four critical methodological flaws: erroneous data splitting, train/test information leakage, inconsistencies between paper descriptions and open-sourced code/data, and biased baseline selection—particularly the omission of simple yet effective baselines. Through empirical reimplementation, data pipeline auditing, fair baseline evaluation, and artifact consistency verification, the study uncovers pervasive issues—including data leakage, improper splits, and description-code mismatches. On Amazon-Book, most claimed state-of-the-art (SOTA) improvements prove illusory; several models even underperform standard baselines. The findings expose fundamental methodological weaknesses and a credibility crisis in current GNN recommendation research. The work contributes a systematic, multi-dimensional reproducibility assessment framework and actionable recommendations to strengthen empirical rigor and reporting standards in recommender systems research.

Technology Category

Application Category

📝 Abstract
Graph-based techniques relying on neural networks and embeddings have gained attention as a way to develop Recommender Systems (RS) with several papers on the topic presented at SIGIR 2022 and 2023. Given the importance of ensuring that published research is methodologically sound and reproducible, in this paper we analyze 10 graph-based RS papers, most of which were published at SIGIR 2022, and assess their impact on subsequent work published in SIGIR 2023. Our analysis reveals several critical points that require attention: (i) the prevalence of bad practices, such as erroneous data splits or information leakage between training and testing data, which call into question the validity of the results; (ii) frequent inconsistencies between the provided artifacts (source code and data) and their descriptions in the paper, causing uncertainty about what is actually being evaluated; and (iii) the preference for new or complex baselines that are weaker compared to simpler ones, creating the impression of continuous improvement even when, particularly for the Amazon-Book dataset, the state-of-the-art has significantly worsened. Due to these issues, we are unable to confirm the claims made in most of the papers we examined and attempted to reproduce.
Problem

Research questions and friction points this paper is trying to address.

Assessing reproducibility of graph-based recommender systems papers.
Identifying inconsistencies between artifacts and paper descriptions.
Evaluating impact of methodological flaws on research validity.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based techniques with neural networks
Reproducibility analysis of SIGIR papers
Assessment of artifact consistency and validity
🔎 Similar Papers
No similar papers found.