Reinforcement Speculative Decoding for Fast Ranking

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high latency of large language models (LLMs) in information retrieval (IR) and recommender systems (RS) caused by autoregressive decoding—where single-token decoding suffers from sharp tail-ranking degradation and speculative decoding incurs uncontrolled verification rounds and loss of list-level ranking knowledge—this paper proposes a **top-down reinforcement-guided speculative decoding paradigm for low-latency ranking**. We introduce the first ranking-aware reinforcement speculative framework, featuring a **list-level knowledge cross-round reuse mechanism**, alongside ranking-customized policy optimization and theoretical robustness guarantees. Leveraging PPO, multi-round verification-driven sequence correction, and list-level modeling, our method achieves inference acceleration under strict latency budgets. Experiments demonstrate substantial throughput gains on IR/RS benchmarks while preserving—and often surpassing—baseline ranking quality, particularly improving accuracy at tail positions.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy with verification at different positions, they face challenges in ranking systems due to their left-to-right decoding paradigm. Firstly, ranking systems require strict latency constraints, but verification rounds in SD methods remain agnostic; Secondly, SD methods usually discard listwise ranking knowledge about unaccepted items in previous rounds, hindering future multi-token prediction, especially when candidate tokens are the unaccepted items. In this paper, we propose a Reinforcement Speculative Decoding method for fast ranking inference of LLMs. To meet the ranking systems' latency requirement, we propose an up-to-down decoding paradigm that employs an agent to iteratively modify the ranking sequence under a constrained budget. Specifically, we design a ranking-tailored policy optimization, actively exploring optimal multi-round ranking modification policy verified by LLMs via reinforcement learning (RL). To better approximate the target LLM under the constrained budget, we trigger the agent fully utilizing the listwise ranking knowledge about all items verified by LLMs across different rounds in RL, enhancing the modification policy of the agent. More importantly, we demonstrate the theoretical robustness and advantages of our paradigm and implementation. Experiments on both IR and RS tasks show the effectiveness of our proposed method.
Problem

Research questions and friction points this paper is trying to address.

Reduces latency in LLM ranking systems with speculative decoding
Addresses ranking degradation in tail positions via multi-token prediction
Optimizes listwise ranking knowledge under strict latency constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Speculative Decoding for fast ranking
Up-to-down decoding with constrained budget
Ranking-tailored policy optimization via RL
Yingpeng Du
Yingpeng Du
Nanyang Technological University
Recommender systemEnsemble learningLLMs
Tianjun Wei
Tianjun Wei
Nanyang Technological University
User ModelingLarge Language ModelRecommender System
Z
Zhu Sun
Information Systems Technology and Design, Singapore University of Technology and Design, Singapore
J
Jie Zhang
College of Computing and Data Science, Nanyang Technological University, Singapore