🤖 AI Summary
Traditional offline evaluation fails to characterize the temporal evolution of recommendation models—particularly their dynamic trade-off between stability (preserving historical knowledge) and plasticity (adapting to new data) under continual retraining. This paper introduces the first evaluation framework explicitly designed for long-term model evolution, featuring a data-, algorithm-, and metric-agnostic sliding-window retraining protocol. By systematically retraining models in stages and tracking performance across time points, the framework quantifies stability and plasticity as orthogonal, interpretable dimensions. Empirical evaluation on the GoodReads dataset across three representative recommendation paradigms reveals distinct behavioral trajectories and confirms an inherent, fundamental trade-off between the two properties. The proposed framework establishes a new paradigm for modeling dynamic recommendation behavior and rigorously assessing algorithmic robustness under temporal distribution shift.
📝 Abstract
The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only captures performance of a particular model trained at some point in the past. We know, however, that online systems evolve over time. In general, it is a good idea that models reflect such changes, so models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns -- stability -- and, on the other hand, (quickly) adapt to changes -- plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity.Although additional experiments will be necessary to confirm these observations, they already illustrate the usefulness of the proposed framework to gain insights on the long term dynamics of recommendation models.