RewardRank: Optimizing True Learning-to-Rank Utility

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Traditional ranking models rely on simplified surrogate losses (e.g., pointwise relevance), failing to capture real-world user behavioral biases—such as position bias, brand affinity, decoy effects, and similarity aversion—leading to misalignment between optimization objectives and actual list-level utility (e.g., click or purchase probability). To address this, we propose RewardRank: the first data-driven counterfactual reward learning framework for ranking. It employs a deep utility model to estimate holistic user engagement over entire permutations and enables end-to-end optimization via differentiable soft permutations. We further introduce KD-Eval and LLM-Eval—two automated, complementary evaluation protocols—to reliably estimate utilities of unseen permutations. Extensive experiments on benchmarks including Baidu-ULTR and Amazon KDD Cup demonstrate significant improvements over strong baselines, validating that explicitly modeling dynamic user behavior is critical for enhancing real-world interactive utility.

Technology Category

Application Category

📝 Abstract

Traditional ranking systems rely on proxy loss functions that assume simplistic user behavior, such as users preferring a rank list where items are sorted by hand-crafted relevance. However, real-world user interactions are influenced by complex behavioral biases, including position bias, brand affinity, decoy effects, and similarity aversion, which these objectives fail to capture. As a result, models trained on such losses often misalign with actual user utility, such as the probability of any click or purchase across the ranked list. In this work, we propose a data-driven framework for modeling user behavior through counterfactual reward learning. Our method, RewardRank, first trains a deep utility model to estimate user engagement for entire item permutations using logged data. Then, a ranking policy is optimized to maximize predicted utility via differentiable soft permutation operators, enabling end-to-end training over the space of factual and counterfactual rankings. To address the challenge of evaluation without ground-truth for unseen permutations, we introduce two automated protocols: (i) $ extit{KD-Eval}$, using a position-aware oracle for counterfactual reward estimation, and (ii) $ extit{LLM-Eval}$, which simulates user preferences via large language models. Experiments on large-scale benchmarks, including Baidu-ULTR and the Amazon KDD Cup datasets, demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of modeling user behavior dynamics for utility-optimized ranking. Our code is available at: https://github.com/GauravBh1010tt/RewardRank

Problem

Research questions and friction points this paper is trying to address.

Optimizing ranking systems for real user utility

Addressing behavioral biases in user interaction models

Learning counterfactual rewards for improved ranking performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual reward learning for user behavior modeling

Differentiable soft permutation operators for end-to-end training

Automated evaluation protocols using oracle and LLM simulation

🔎 Similar Papers

ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models