D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses a key limitation in existing parallel speculative decoding, where multi-token draft models employ fixed positional weighting strategies that fail to dynamically adapt to critical positions governing acceptance rates during training. To overcome this, the authors propose a dynamic position-aware cross-entropy loss function that leverages a differentiable proxy for expected acceptance length to adaptively reweight training contributions across token positions, thereby concentrating optimization signals on those most influential to acceptance rates. This approach introduces, for the first time, a dynamic weighting mechanism aligned with the gradient contribution of each position to the acceptance rate, without requiring modifications to model architecture or inference pipelines. Evaluated across six benchmarks, the method significantly improves practical speedup and average output length while incurring only a 2.3% increase in training overhead.

📝 Abstract

Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the drafter improves. Across six benchmarks, two Qwen3-4B draft depths, two decoding temperatures, and two additional target models, D-PACE consistently improves both wall-clock speedup and average emitted length, with 2.3\% measured training-time overhead and no changes to the drafter architecture or inference procedure.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

parallel drafting

position-aware weighting

multi-token drafter

training objective

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

dynamic weighting

parallel drafting