Towards Two-Stage Counterfactual Learning to Rank

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing counterfactual learning to rank (CLTR) methods predominantly assume a single-stage, full-list ranking setting, rendering them ill-suited for real-world two-stage retrieval systems—comprising candidate generation followed by fine-grained ranking—especially at million-scale document volumes. Current two-stage CLTR approaches are limited to single-document exposure and rely on fixed, pre-trained rankers, thus failing to enable joint optimization of the generator and ranker. This paper introduces, for the first time, a two-stage CLTR estimator that explicitly models cross-stage interaction effects. By integrating propensity scoring with value estimation, we design an end-to-end differentiable framework supporting joint gradient-based optimization of both stages. Evaluated on semi-synthetic benchmarks, our method significantly outperforms existing baselines in effectiveness, scalability, and bias correction capability.

Technology Category

Application Category

📝 Abstract
Counterfactual learning to rank (CLTR) aims to learn a ranking policy from user interactions while correcting for the inherent biases in interaction data, such as position bias. Existing CLTR methods assume a single ranking policy that selects top-K ranking from the entire document candidate set. In real-world applications, the candidate document set is on the order of millions, making a single-stage ranking policy impractical. In order to scale to millions of documents, real-world ranking systems are designed in a two-stage fashion, with a candidate generator followed by a ranker. The existing CLTR method for a two-stage offline ranking system only considers the top-1 ranking set-up and only focuses on training the candidate generator, with the ranker fixed. A CLTR method for training both the ranker and candidate generator jointly is missing from the existing literature. In this paper, we propose a two-stage CLTR estimator that considers the interaction between the two stages and estimates the joint value of the two policies offline. In addition, we propose a novel joint optimization method to train the candidate and ranker policies, respectively. To the best of our knowledge, we are the first to propose a CLTR estimator and learning method for two-stage ranking. Experimental results on a semi-synthetic benchmark demonstrate the effectiveness of the proposed joint CLTR method over baselines.
Problem

Research questions and friction points this paper is trying to address.

Addresses bias in two-stage ranking systems
Lacks joint training for ranker and generator
Proposes novel two-stage CLTR estimator
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage CLTR estimator for joint policy value
Joint optimization of candidate and ranker policies
First CLTR method for two-stage ranking systems
🔎 Similar Papers
No similar papers found.