RewardRank: Optimizing True Learning-to-Rank Utility

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional ranking models rely on simplified surrogate losses (e.g., pointwise relevance), failing to capture real-world user behavioral biases—such as position bias, brand affinity, decoy effects, and similarity aversion—leading to misalignment between optimization objectives and actual list-level utility (e.g., click or purchase probability). To address this, we propose RewardRank: the first data-driven counterfactual reward learning framework for ranking. It employs a deep utility model to estimate holistic user engagement over entire permutations and enables end-to-end optimization via differentiable soft permutations. We further introduce KD-Eval and LLM-Eval—two automated, complementary evaluation protocols—to reliably estimate utilities of unseen permutations. Extensive experiments on benchmarks including Baidu-ULTR and Amazon KDD Cup demonstrate significant improvements over strong baselines, validating that explicitly modeling dynamic user behavior is critical for enhancing real-world interactive utility.

Technology Category

Application Category

📝 Abstract
Traditional ranking systems rely on proxy loss functions that assume simplistic user behavior, such as users preferring a rank list where items are sorted by hand-crafted relevance. However, real-world user interactions are influenced by complex behavioral biases, including position bias, brand affinity, decoy effects, and similarity aversion, which these objectives fail to capture. As a result, models trained on such losses often misalign with actual user utility, such as the probability of any click or purchase across the ranked list. In this work, we propose a data-driven framework for modeling user behavior through counterfactual reward learning. Our method, RewardRank, first trains a deep utility model to estimate user engagement for entire item permutations using logged data. Then, a ranking policy is optimized to maximize predicted utility via differentiable soft permutation operators, enabling end-to-end training over the space of factual and counterfactual rankings. To address the challenge of evaluation without ground-truth for unseen permutations, we introduce two automated protocols: (i) $ extit{KD-Eval}$, using a position-aware oracle for counterfactual reward estimation, and (ii) $ extit{LLM-Eval}$, which simulates user preferences via large language models. Experiments on large-scale benchmarks, including Baidu-ULTR and the Amazon KDD Cup datasets, demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of modeling user behavior dynamics for utility-optimized ranking. Our code is available at: https://github.com/GauravBh1010tt/RewardRank
Problem

Research questions and friction points this paper is trying to address.

Optimizing ranking systems for real user utility
Addressing behavioral biases in user interaction models
Learning counterfactual rewards for improved ranking performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual reward learning for user behavior modeling
Differentiable soft permutation operators for end-to-end training
Automated evaluation protocols using oracle and LLM simulation
🔎 Similar Papers
No similar papers found.