LLMs for estimating positional bias in logged interaction data

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Recommendation and search systems often inherit and amplify pre-existing ranking biases due to position bias—e.g., users’ tendency to click top-ranked items—leading to distorted relevance modeling. Method: This paper proposes the first large language model (LLM)-based approach for position bias estimation, enabling fine-grained modeling of row- and column-wise positional effects under complex interface layouts directly from raw interaction logs—overcoming the expressiveness limitations of traditional heuristic methods. The LLM-generated propensity scores are integrated into an inverse propensity scoring (IPS) framework for bias correction and used to train a re-ranking model. Contribution/Results: Experiments show that, while preserving NDCG@10, the weighted NDCG@10 improves by approximately 2%, significantly mitigating layout-induced bias propagation. This work establishes a novel paradigm for trustworthy ranking.

Technology Category

Application Category

📝 Abstract
Recommender and search systems commonly rely on Learning To Rank models trained on logged user interactions to order items by predicted relevance. However, such interaction data is often subject to position bias, as users are more likely to click on items that appear higher in the ranking, regardless of their actual relevance. As a result, newly trained models may inherit and reinforce the biases of prior ranking models rather than genuinely improving relevance. A standard approach to mitigate position bias is Inverse Propensity Scoring (IPS), where the model's loss is weighted by the inverse of a propensity function, an estimate of the probability that an item at a given position is examined. However, accurate propensity estimation is challenging, especially in interfaces with complex non-linear layouts. In this paper, we propose a novel method for estimating position bias using Large Language Models (LLMs) applied to logged user interaction data. This approach offers a cost-effective alternative to online experimentation. Our experiments show that propensities estimated with our LLM-as-a-judge approach are stable across score buckets and reveal the row-column effects of Viator's grid layout that simpler heuristics overlook. An IPS-weighted reranker trained with these propensities matches the production model on standard NDCG@10 while improving weighted NDCG@10 by roughly 2%. We will verify these offline gains in forthcoming live-traffic experiments.
Problem

Research questions and friction points this paper is trying to address.

Estimating positional bias in logged user interaction data for recommender systems
Addressing challenges in accurate propensity estimation for complex interface layouts
Mitigating inherited ranking biases rather than genuinely improving relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to estimate positional bias in logs
Applies LLM-as-a-judge for cost-effective propensity scoring
Enables IPS-weighted reranker improving NDCG performance
🔎 Similar Papers
No similar papers found.
Aleksandr V. Petrov
Aleksandr V. Petrov
Research Scientist, Spotify
Recommender SystemsInformation RetrievalNatural Language ProcessingDeep Learning
M
Michael Murtagh
Viator, Tripadvisor, Lisbon, Portugal
K
Karthik Nagesh
Viator, Tripadvisor, London, UK