RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and slow inference speed of large language models (LLMs), this paper proposes a reinforcement learning–based dynamic draft tree speculative sampling method. Unlike conventional speculative sampling with fixed draft depth or width, our approach is the first to formulate draft tree construction as a Markov decision process (MDP) and employ offline reinforcement learning to dynamically determine, at each decoding step, the number of draft token generations and the tree topology—enabling adaptive and efficient candidate generation and verification. Experiments across three mainstream LLMs and four representative tasks demonstrate that our method achieves 3.17×–4.82× inference speedup over standard autoregressive decoding, while significantly reducing redundant computation. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Inference with modern Large Language Models (LLMs) is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The code is available at https://github.com/minaduki-sora/RADAR.
Problem

Research questions and friction points this paper is trying to address.

Accelerates LLM inference using RL-based dynamic draft trees
Reduces redundant computations in speculative sampling methods
Enables real-time decisions for draft model calls
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses RL-based dynamic draft trees for speculative sampling
Formulates draft generation as Markov Decision Process
Employs offline reinforcement learning to reduce computations
🔎 Similar Papers
No similar papers found.
Junjie Ma
Junjie Ma
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
signal processingmessage passing algorithmsoptimization
J
Jinlong Li
School of Computer Science and Technology, University of Science and Technology of China