🤖 AI Summary
This work addresses the poor reproducibility and distorted model rankings often observed in offline evaluation of recommender systems, which stem from opaque data preprocessing and splitting strategies. To mitigate these issues, we propose an open-source exploratory toolkit that systematically diagnoses the validity of data splits—specifically targeting temporal leakage, cold-start exposure, and distributional shifts. Built in Python, the toolkit integrates time-aware statistics, duplicate interaction detection, distribution shift analysis, and interactive visualizations to enable comprehensive, no-code comparative validation of multiple splitting strategies. It further supports automated audit reporting, significantly enhancing the transparency, reliability, and cross-study comparability of offline evaluations in recommender systems research.
📝 Abstract
Offline evaluation of recommender systems is often affected by hidden, under-documented choices in data preparation. Seemingly minor decisions in filtering, handling repeats, cold-start treatment, and splitting strategy design can substantially reorder model rankings and undermine reproducibility and cross-paper comparability.
In this paper, we introduce SplitLight, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make these decisions measurable, comparable, and reportable. Given an interaction log and derived split subsets, SplitLight analyzes core and temporal dataset statistics, characterizes repeat consumption patterns and timestamp anomalies, and diagnoses split validity, including temporal leakage, cold-user/item exposure, and distribution shifts. SplitLight further allows side-by-side comparison of alternative splitting strategies through comprehensive aggregated summaries and interactive visualizations. Delivered as both a Python toolkit and an interactive no-code interface, SplitLight produces audit summaries that justify evaluation protocols and support transparent, reliable, and comparable experimentation in recommender systems research and industry.