PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

216K/year
πŸ€– AI Summary
This work addresses the misalignment between the training objective of existing draft models and the inference-stage goal of maximizing consecutive token acceptance rates, which limits the acceleration performance of speculative decoding. To resolve this, the authors propose the PARD-2 framework, which reformulates the draft model’s optimization objective to prioritize overall accepted sequence length over individual token accuracy. PARD-2 introduces a Confidence-Adaptive Token (CAT) strategy, enabling a single model to uniformly support both dependent and independent target modes. By aligning the training objective with the speculative verification process and integrating a target-aligned parallel draft model with an adaptive reweighting mechanism, the method significantly enhances consecutive acceptance length. Evaluated on Llama3.1-8B, PARD-2 achieves up to 6.94Γ— lossless speedup, outperforming EAGLE-3 and PARD by 1.9Γ— and 1.3Γ—, respectively.
πŸ“ Abstract
Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate that PARD-2 achieves up to 6.94$\times$ lossless acceleration, surpassing EAGLE-3 by 1.9$\times$ and PARD by 1.3$\times$ on Llama3.1-8B. Our code is available at https://github.com/AMD-AGI/PARD.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
draft model
token acceptance
LLM inference acceleration
training objective alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
draft model optimization
acceptance length maximization
dual-mode decoding
Confidence-Adaptive Token (CAT)
πŸ”Ž Similar Papers
2023-12-18Neural Information Processing SystemsCitations: 52