🤖 AI Summary
This work addresses the data sparsity challenge in generative recommender systems under user and item cold-start scenarios by establishing the first standardized evaluation protocol for cold-start recommendation. The authors systematically reproduce and analyze state-of-the-art generative recommendation approaches based on pretrained language models (PLMs), carefully controlling key variables such as model scale, identifier design, and training strategies. Through comprehensive ablation and comparative experiments, they reveal that the performance of existing methods is significantly constrained by a confluence of confounding design choices rather than inherent architectural limitations. By providing a rigorous, reproducible benchmark, this study enhances the reliability of empirical conclusions in cold-start recommendation research and offers clear directions for future improvements.
📝 Abstract
Cold-start recommendation remains a central challenge in dynamic, open-world platforms, requiring models to recommend for newly registered users (user cold-start) and to recommend newly introduced items to existing users (item cold-start) under sparse or missing interaction signals. Recent generative recommenders built on pre-trained language models (PLMs) are often expected to mitigate cold-start by using item semantic information (e.g., titles and descriptions) and test-time conditioning on limited user context. However, cold-start is rarely treated as a primary evaluation setting in existing studies, and reported gains are difficult to interpret because key design choices, such as model scale, identifier design, and training strategy, are frequently changed together. In this work, we present a systematic reproducibility study of generative recommendation under a unified suite of cold-start protocols.