π€ AI Summary
This work addresses the challenge of end-to-end training in two-stage ranking systems, where the initial retriever operates over an extremely large candidate set, rendering conventional policy gradient methods ineffective due to exploding gradient variance. To overcome this, the authors propose Credit Assignment Policy Gradient (CA-PG), which innovatively integrates a credit assignment mechanism into the policy gradient framework. CA-PG computes the marginal probability gradient of selecting the target item by marginalizing over all candidate sets containing it, thereby substantially reducing gradient variance while preserving ranking accuracy. Theoretical analysis grounded in the PlackettβLuce model and empirical evaluations on both synthetic and real-world datasets demonstrate that CA-PG significantly accelerates convergence and enhances training stability, with pronounced advantages particularly evident in scenarios involving large-scale candidate sets.
π Abstract
Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.