🤖 AI Summary
Traditional ranking models rely on simplified surrogate losses (e.g., pointwise relevance), failing to capture real-world user behavioral biases—such as position bias, brand affinity, decoy effects, and similarity aversion—leading to misalignment between optimization objectives and actual list-level utility (e.g., click or purchase probability). To address this, we propose RewardRank: the first data-driven counterfactual reward learning framework for ranking. It employs a deep utility model to estimate holistic user engagement over entire permutations and enables end-to-end optimization via differentiable soft permutations. We further introduce KD-Eval and LLM-Eval—two automated, complementary evaluation protocols—to reliably estimate utilities of unseen permutations. Extensive experiments on benchmarks including Baidu-ULTR and Amazon KDD Cup demonstrate significant improvements over strong baselines, validating that explicitly modeling dynamic user behavior is critical for enhancing real-world interactive utility.
📝 Abstract
Traditional ranking systems rely on proxy loss functions that assume simplistic user behavior, such as users preferring a rank list where items are sorted by hand-crafted relevance. However, real-world user interactions are influenced by complex behavioral biases, including position bias, brand affinity, decoy effects, and similarity aversion, which these objectives fail to capture. As a result, models trained on such losses often misalign with actual user utility, such as the probability of any click or purchase across the ranked list. In this work, we propose a data-driven framework for modeling user behavior through counterfactual reward learning. Our method, RewardRank, first trains a deep utility model to estimate user engagement for entire item permutations using logged data. Then, a ranking policy is optimized to maximize predicted utility via differentiable soft permutation operators, enabling end-to-end training over the space of factual and counterfactual rankings. To address the challenge of evaluation without ground-truth for unseen permutations, we introduce two automated protocols: (i) $ extit{KD-Eval}$, using a position-aware oracle for counterfactual reward estimation, and (ii) $ extit{LLM-Eval}$, which simulates user preferences via large language models. Experiments on large-scale benchmarks, including Baidu-ULTR and the Amazon KDD Cup datasets, demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of modeling user behavior dynamics for utility-optimized ranking. Our code is available at: https://github.com/GauravBh1010tt/RewardRank