๐ค AI Summary
This work addresses the lack of systematic evaluation of the generalization capabilities of generative recommender models, which often obscures whether their performance stems from genuine generalization or mere memorization of training data. For the first time, we decouple memorization and generalization at the instance level and introduce a classification-based evaluation framework that reveals how the apparent โgeneralizationโ of generative models frequently relies on token-level memorization, whereas traditional ID-based models exhibit superior performance in memorization tasks. Building on this insight, we propose memory-aware metrics and an adaptive fusion strategy that dynamically leverages the complementary strengths of both model types. Extensive experiments demonstrate that our approach significantly enhances overall recommendation performance, thereby validating the effectiveness of such a hybrid, memory-conscious design.
๐ Abstract
A widely held hypothesis for why generative recommendation (GR) models outperform conventional item ID-based models is that they generalize better. However, there is few systematic way to verify this hypothesis beyond a superficial comparison of overall performance. To address this gap, we categorize each data instance based on the specific capability required for a correct prediction: either memorization (reusing item transition patterns observed during training) or generalization (composing known patterns to predict unseen item transitions). Extensive experiments show that GR models perform better on instances that require generalization, whereas item ID-based models perform better when memorization is more important. To explain this divergence, we shift the analysis from the item level to the token level and show that what appears to be item-level generalization often reduces to token-level memorization for GR models. Finally, we show that the two paradigms are complementary. We propose a simple memorization-aware indicator that adaptively combines them on a per-instance basis, leading to improved overall recommendation performance.