🤖 AI Summary
This work identifies a fundamental conceptual mismatch between diffusion models and top-N recommendation: the generative paradigm of diffusion models is incompatible with ranking tasks under implicit feedback, leading to severely degraded generation capability and inflated performance estimates. Method: We systematically reproduce four representative diffusion-based recommendation models published at SIGIR 2023–2024, conducting rigorous benchmarking, ablation studies, hyperparameter re-optimization, and carbon footprint analysis. Contribution/Results: All diffusion models consistently underperform lightweight baselines (e.g., LightGCN). We uncover a systemic “methodological hallucination” in this area—revealing deep flaws including poor reproducibility and inappropriate task modeling. Our findings provide critical methodological reflection and cautionary guidance for the principled application of generative models in recommender systems.
📝 Abstract
Countless new machine learning models are published every year and are reported to significantly advance the state-of-the-art in emph{top-n} recommendation. However, earlier reproducibility studies indicate that progress in this area may be quite limited. Specifically, various widespread methodological issues, e.g., comparisons with untuned baseline models, have led to an emph{illusion of progress}. In this work, our goal is to examine whether these problems persist in today's research. To this end, we aim to reproduce the latest advancements reported from applying modern Denoising Diffusion Probabilistic Models to recommender systems, focusing on four models published at the top-ranked SIGIR conference in 2023 and 2024. Our findings are concerning, revealing persistent methodological problems. Alarmingly, through experiments, we find that the latest recommendation techniques based on diffusion models, despite their computational complexity and substantial carbon footprint, are consistently outperformed by simpler existing models. Furthermore, we identify key mismatches between the characteristics of diffusion models and those of the traditional emph{top-n} recommendation task, raising doubts about their suitability for recommendation. We also note that, in the papers we analyze, the generative capabilities of these models are constrained to a minimum. Overall, our results and continued methodological issues call for greater scientific rigor and a disruptive change in the research and publication culture in this area.