Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits

📅 2024-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the core problem in multi-draft speculative sampling: how to select, at token-level granularity, the optimal proposal from candidate sequences generated by multiple independent draft models, such that the output distribution strictly matches that of the target model. We propose a two-step optimal token selection mechanism: first correcting draft bias via importance sampling, then performing single-draft verification. We present, for the first time, a canonical decomposition of multi-draft selection; derive necessary and sufficient conditions for unit acceptance rate under dual-isomorphic drafts, as well as a closed-form solution for the optimal acceptance probability in the general case; and introduce weighted importance sampling as a novel token-selection paradigm. Theoretical analysis and empirical evaluation demonstrate that our method significantly improves block efficiency and throughput while preserving exact distributional fidelity—outperforming all existing baselines.

Technology Category

Application Category

📝 Abstract
We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection schemes based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.
Problem

Research questions and friction points this paper is trying to address.

Optimize token-level draft selection in multi-draft sampling
Decompose optimal scheme into two-step importance sampling
Establish conditions for maximum acceptance probability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-draft speculative sampling with independent proposal sequences
Two-step optimal scheme: importance sampling and speculative sampling
Weighted importance sampling for improved token selection
🔎 Similar Papers
No similar papers found.