Debias Can be Unreliable: Mitigating Bias Issue in Evaluating Debiasing Recommendation

πŸ“… 2024-09-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing recommendation debiasing research predominantly relies on randomly exposed datasets for evaluation; however, the conventional Recall metric exhibits systematic bias under this setting, leading to unreliable model performance assessment. Method: This paper formally identifies and proves the theoretical origin of this bias, and proposes UREβ€”the first theoretically sound, unbiased Recall estimation framework that requires no access to full-exposure data. URE leverages counterfactual inference and exposure probability modeling to debias random exposure and accurately estimate true Recall under full-exposure conditions. Contribution/Results: Extensive experiments across multiple real-world datasets demonstrate that URE significantly improves alignment between estimated Recall and the full-exposure ground truth, achieving an average Spearman correlation gain of 0.82. Crucially, URE corrects widespread performance misjudgments of existing debiasing methods, enabling more reliable and fair evaluation.

Technology Category

Application Category

πŸ“ Abstract
Recent work has improved recommendation models remarkably by equipping them with debiasing methods. Due to the unavailability of fully-exposed datasets, most existing approaches resort to randomly-exposed datasets as a proxy for evaluating debiased models, employing traditional evaluation scheme to represent the recommendation performance. However, in this study, we reveal that traditional evaluation scheme is not suitable for randomly-exposed datasets, leading to inconsistency between the Recall performance obtained using randomly-exposed datasets and that obtained using fully-exposed datasets. Such inconsistency indicates the potential unreliability of experiment conclusions on previous debiasing techniques and calls for unbiased Recall evaluation using randomly-exposed datasets. To bridge the gap, we propose the Unbiased Recall Evaluation (URE) scheme, which adjusts the utilization of randomly-exposed datasets to unbiasedly estimate the true Recall performance on fully-exposed datasets. We provide theoretical evidence to demonstrate the rationality of URE and perform extensive experiments on real-world datasets to validate its soundness.
Problem

Research questions and friction points this paper is trying to address.

Traditional evaluation unreliable for randomly-exposed datasets
Inconsistency in Recall performance between dataset types
Need unbiased evaluation for debiasing recommendation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Unbiased Recall Evaluation (URE) scheme
Adjusts randomly-exposed datasets for unbiased estimation
Validates URE with theoretical and experimental evidence
πŸ”Ž Similar Papers
No similar papers found.
C
Chengbing Wang
University of Science and Technology of China
W
Wentao Shi
University of Science and Technology of China
Jizhi Zhang
Jizhi Zhang
USTC
RecommendationTrustworthy AILarge Personalized Model
W
Wenjie Wang
National University of Singapore
H
Hang Pan
University of Science and Technology of China
F
Fuli Feng
University of Science and Technology of China