🤖 AI Summary
Recommendation and search systems often inherit and amplify pre-existing ranking biases due to position bias—e.g., users’ tendency to click top-ranked items—leading to distorted relevance modeling. Method: This paper proposes the first large language model (LLM)-based approach for position bias estimation, enabling fine-grained modeling of row- and column-wise positional effects under complex interface layouts directly from raw interaction logs—overcoming the expressiveness limitations of traditional heuristic methods. The LLM-generated propensity scores are integrated into an inverse propensity scoring (IPS) framework for bias correction and used to train a re-ranking model. Contribution/Results: Experiments show that, while preserving NDCG@10, the weighted NDCG@10 improves by approximately 2%, significantly mitigating layout-induced bias propagation. This work establishes a novel paradigm for trustworthy ranking.
📝 Abstract
Recommender and search systems commonly rely on Learning To Rank models trained on logged user interactions to order items by predicted relevance. However, such interaction data is often subject to position bias, as users are more likely to click on items that appear higher in the ranking, regardless of their actual relevance. As a result, newly trained models may inherit and reinforce the biases of prior ranking models rather than genuinely improving relevance. A standard approach to mitigate position bias is Inverse Propensity Scoring (IPS), where the model's loss is weighted by the inverse of a propensity function, an estimate of the probability that an item at a given position is examined. However, accurate propensity estimation is challenging, especially in interfaces with complex non-linear layouts. In this paper, we propose a novel method for estimating position bias using Large Language Models (LLMs) applied to logged user interaction data. This approach offers a cost-effective alternative to online experimentation. Our experiments show that propensities estimated with our LLM-as-a-judge approach are stable across score buckets and reveal the row-column effects of Viator's grid layout that simpler heuristics overlook. An IPS-weighted reranker trained with these propensities matches the production model on standard NDCG@10 while improving weighted NDCG@10 by roughly 2%. We will verify these offline gains in forthcoming live-traffic experiments.