π€ AI Summary
Existing recommendation debiasing research predominantly relies on randomly exposed datasets for evaluation; however, the conventional Recall metric exhibits systematic bias under this setting, leading to unreliable model performance assessment.
Method: This paper formally identifies and proves the theoretical origin of this bias, and proposes UREβthe first theoretically sound, unbiased Recall estimation framework that requires no access to full-exposure data. URE leverages counterfactual inference and exposure probability modeling to debias random exposure and accurately estimate true Recall under full-exposure conditions.
Contribution/Results: Extensive experiments across multiple real-world datasets demonstrate that URE significantly improves alignment between estimated Recall and the full-exposure ground truth, achieving an average Spearman correlation gain of 0.82. Crucially, URE corrects widespread performance misjudgments of existing debiasing methods, enabling more reliable and fair evaluation.
π Abstract
Recent work has improved recommendation models remarkably by equipping them with debiasing methods. Due to the unavailability of fully-exposed datasets, most existing approaches resort to randomly-exposed datasets as a proxy for evaluating debiased models, employing traditional evaluation scheme to represent the recommendation performance. However, in this study, we reveal that traditional evaluation scheme is not suitable for randomly-exposed datasets, leading to inconsistency between the Recall performance obtained using randomly-exposed datasets and that obtained using fully-exposed datasets. Such inconsistency indicates the potential unreliability of experiment conclusions on previous debiasing techniques and calls for unbiased Recall evaluation using randomly-exposed datasets. To bridge the gap, we propose the Unbiased Recall Evaluation (URE) scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance on fully-exposed datasets. We provide theoretical evidence to demonstrate the rationality of URE and perform extensive experiments on real-world datasets to validate its soundness.