Reinforcement Speculative Decoding for Fast Ranking

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the high latency of large language models (LLMs) in information retrieval (IR) and recommender systems (RS) caused by autoregressive decoding—where single-token decoding suffers from sharp tail-ranking degradation and speculative decoding incurs uncontrolled verification rounds and loss of list-level ranking knowledge—this paper proposes a **top-down reinforcement-guided speculative decoding paradigm for low-latency ranking**. We introduce the first ranking-aware reinforcement speculative framework, featuring a **list-level knowledge cross-round reuse mechanism**, alongside ranking-customized policy optimization and theoretical robustness guarantees. Leveraging PPO, multi-round verification-driven sequence correction, and list-level modeling, our method achieves inference acceleration under strict latency budgets. Experiments demonstrate substantial throughput gains on IR/RS benchmarks while preserving—and often surpassing—baseline ranking quality, particularly improving accuracy at tail positions.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy with verification at different positions, they face challenges in ranking systems due to their left-to-right decoding paradigm. Firstly, ranking systems require strict latency constraints, but verification rounds in SD methods remain agnostic; Secondly, SD methods usually discard listwise ranking knowledge about unaccepted items in previous rounds, hindering future multi-token prediction, especially when candidate tokens are the unaccepted items. In this paper, we propose a Reinforcement Speculative Decoding method for fast ranking inference of LLMs. To meet the ranking systems' latency requirement, we propose an up-to-down decoding paradigm that employs an agent to iteratively modify the ranking sequence under a constrained budget. Specifically, we design a ranking-tailored policy optimization, actively exploring optimal multi-round ranking modification policy verified by LLMs via reinforcement learning (RL). To better approximate the target LLM under the constrained budget, we trigger the agent fully utilizing the listwise ranking knowledge about all items verified by LLMs across different rounds in RL, enhancing the modification policy of the agent. More importantly, we demonstrate the theoretical robustness and advantages of our paradigm and implementation. Experiments on both IR and RS tasks show the effectiveness of our proposed method.

Problem

Research questions and friction points this paper is trying to address.

Reduces latency in LLM ranking systems with speculative decoding

Addresses ranking degradation in tail positions via multi-token prediction

Optimizes listwise ranking knowledge under strict latency constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Speculative Decoding for fast ranking

Up-to-down decoding with constrained budget

Ranking-tailored policy optimization via RL

🔎 Similar Papers

Non-autoregressive Generative Models for Reranking Recommendation