Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Leave-one-out (LOO) evaluation—widely adopted in sequential recommendation—suffers from temporal leakage and excessively long test horizons, failing to reflect real-world deployment conditions. While global time-based splitting is more realistic, its lack of standardized protocols for target interaction selection and validation set construction leads to incomparable results and unstable model rankings. Method: We systematically analyze how prevalent data splitting strategies impact offline evaluation, and propose a规范化 framework for global time-based splitting, specifying precise definitions of target interactions and enforcing consistency criteria for validation and test sets. Contribution/Results: Extensive experiments across multiple benchmarks with mainstream baseline models demonstrate significant performance discrepancies between LOO and time-aware evaluation, validating the latter’s superior realism. We open-source our implementation, providing a reproducible, industry-aligned evaluation benchmark that advances standardization in sequential recommendation assessment.

Technology Category

Application Category

📝 Abstract

Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios. Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: https://github.com/monkey0head/time-to-split

Problem

Research questions and friction points this paper is trying to address.

Evaluating sequential recommender systems accurately

Addressing temporal leakage in data splits

Improving reproducibility with realistic splitting strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global temporal splitting for sequential recommendations

Systematic comparison of splitting strategies

Aligning evaluation with real-world scenarios

🔎 Similar Papers

A Comprehensive Survey on Retrieval Methods in Recommender Systems