On the Reliability of Sampling Strategies in Offline Recommender Evaluation

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline recommendation evaluation, exposure bias and sampling bias severely compromise assessment reliability, leading to distorted model performance comparisons. To address this, we propose a four-dimensional evaluation framework—discriminability, fidelity, robustness, and predictive power—to systematically quantify how sampling strategies distort evaluation outcomes. Leveraging fully observed ground-truth data, we simulate diverse exposure bias scenarios and empirically analyze the performance of mainstream sampling strategies—including uniform and popularity-based sampling. Experimental results reveal significant trade-offs across dimensions, with no universally optimal strategy. Accordingly, we provide a task-aware sampling strategy selection guideline tailored to specific evaluation objectives. This work establishes the first systematic, reliability-oriented evaluation paradigm for sampling strategies in offline recommendation, offering both theoretical foundations and practical guidance for trustworthy offline evaluation.

Technology Category

Application Category

📝 Abstract
Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling strategies along four dimensions: sampling resolution (recommender model separability), fidelity (agreement with full evaluation), robustness (stability under exposure bias), and predictive power (alignment with ground truth). Our findings highlight when and how sampling distorts evaluation outcomes and offer practical guidance for selecting strategies that yield faithful and robust offline comparisons.
Problem

Research questions and friction points this paper is trying to address.

Assessing reliability of offline recommender evaluation methods
Mitigating exposure and sampling biases in recommender systems
Evaluating sampling strategies under diverse exposure conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulate diverse exposure biases systematically
Assess sampling strategies along four dimensions
Provide guidance for robust offline comparisons
🔎 Similar Papers
No similar papers found.