A Worrying Reproducibility Study of Intent-Aware Recommendation Models

📅 2025-01-17

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses critical issues in intent-aware recommendation systems (IARS): poor reproducibility, inflated performance claims, and methodological rigor deficits. We conduct the first large-scale, strictly controlled reproducibility assessment in this domain. Specifically, we systematically reimplement five state-of-the-art IARS models and evaluate them across multiple public benchmarks using original code, reported hyperparameters, and a unified evaluation protocol—comparing against classical non-neural baselines (e.g., MF, BPR) on multiple standard metrics. Our findings reveal that two models fail to reproduce their originally reported results; moreover, every IARS model is significantly outperformed by at least one traditional method. These results demonstrate that the practical gains of current IARS approaches are limited, underscoring the urgent need for community-wide reproducibility benchmarks and rigorous evaluation practices. This study provides empirical grounding and methodological caution to foster healthy, evidence-based progress in intent-aware recommendation research.

Technology Category

Application Category

📝 Abstract

Lately, we have observed a growing interest in intent-aware recommender systems (IARS). The promise of such systems is that they are capable of generating better recommendations by predicting and considering the underlying motivations and short-term goals of consumers. From a technical perspective, various sophisticated neural models were recently proposed in this emerging and promising area. In the broader context of complex neural recommendation models, a growing number of research works unfortunately indicates that (i) reproducing such works is often difficult and (ii) that the true benefits of such models may be limited in reality, e.g., because the reported improvements were obtained through comparisons with untuned or weak baselines. In this work, we investigate if recent research in IARS is similarly affected by such problems. Specifically, we tried to reproduce five contemporary IARS models that were published in top-level outlets, and we benchmarked them against a number of traditional non-neural recommendation models. In two of the cases, running the provided code with the optimal hyperparameters reported in the paper did not yield the results reported in the paper. Worryingly, we find that all examined IARS approaches are consistently outperformed by at least one traditional model. These findings point to sustained methodological issues and to a pressing need for more rigorous scholarly practices.

Problem

Research questions and friction points this paper is trying to address.

Reproducibility

Model Reliability

Performance Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

IARS Evaluation

Reproducibility Issues

Traditional Methods Comparison

🔎 Similar Papers

No similar papers found.