🤖 AI Summary
Leave-one-out (LOO) evaluation—widely adopted in sequential recommendation—suffers from temporal leakage and excessively long test horizons, failing to reflect real-world deployment conditions. While global time-based splitting is more realistic, its lack of standardized protocols for target interaction selection and validation set construction leads to incomparable results and unstable model rankings.
Method: We systematically analyze how prevalent data splitting strategies impact offline evaluation, and propose a规范化 framework for global time-based splitting, specifying precise definitions of target interactions and enforcing consistency criteria for validation and test sets.
Contribution/Results: Extensive experiments across multiple benchmarks with mainstream baseline models demonstrate significant performance discrepancies between LOO and time-aware evaluation, validating the latter’s superior realism. We open-source our implementation, providing a reproducible, industry-aligned evaluation benchmark that advances standardization in sequential recommendation assessment.
📝 Abstract
Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios.
Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics.
In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: https://github.com/monkey0head/time-to-split