🤖 AI Summary
In offline recommendation evaluation, exposure bias and sampling bias severely compromise assessment reliability, leading to distorted model performance comparisons. To address this, we propose a four-dimensional evaluation framework—discriminability, fidelity, robustness, and predictive power—to systematically quantify how sampling strategies distort evaluation outcomes. Leveraging fully observed ground-truth data, we simulate diverse exposure bias scenarios and empirically analyze the performance of mainstream sampling strategies—including uniform and popularity-based sampling. Experimental results reveal significant trade-offs across dimensions, with no universally optimal strategy. Accordingly, we provide a task-aware sampling strategy selection guideline tailored to specific evaluation objectives. This work establishes the first systematic, reliability-oriented evaluation paradigm for sampling strategies in offline recommendation, offering both theoretical foundations and practical guidance for trustworthy offline evaluation.
📝 Abstract
Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling strategies along four dimensions: sampling resolution (recommender model separability), fidelity (agreement with full evaluation), robustness (stability under exposure bias), and predictive power (alignment with ground truth). Our findings highlight when and how sampling distorts evaluation outcomes and offer practical guidance for selecting strategies that yield faithful and robust offline comparisons.