Parallel Sampling via Autospeculation

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the sampling efficiency bottleneck in both arbitrary-order autoregressive models and denoising diffusion models. Methodologically, under a bounded-support assumption, it introduces a novel acceleration framework leveraging parallel computation and speculative rejection sampling: it invokes conditional marginal (for autoregressive models) or conditional mean (for diffusion models) oracles in parallel, and employs a “self-speculative” mechanism to construct an auxiliary distribution enabling sequence-level speculation and batched acceptance. Theoretically, it is the first to reduce the expected sampling time for both model classes from the standard $widetilde{O}(n)$ to $widetilde{O}(n^{1/2})$; for autoregressive models specifically, it improves upon prior $widetilde{O}(n^{2/3})$ bounds to $widetilde{O}(n^{1/2})$. Crucially, it is also the first to achieve substantial parallel speedup for denoising diffusion models under high-precision requirements, thereby overcoming their inherent sequential constraint.

Technology Category

Application Category

📝 Abstract
We present parallel algorithms to accelerate sampling via counting in two settings: any-order autoregressive models and denoising diffusion models. An any-order autoregressive model accesses a target distribution $mu$ on $[q]^n$ through an oracle that provides conditional marginals, while a denoising diffusion model accesses a target distribution $mu$ on $mathbb{R}^n$ through an oracle that provides conditional means under Gaussian noise. Standard sequential sampling algorithms require $widetilde{O}(n)$ time to produce a sample from $mu$ in either setting. We show that, by issuing oracle calls in parallel, the expected sampling time can be reduced to $widetilde{O}(n^{1/2})$. This improves the previous $widetilde{O}(n^{2/3})$ bound for any-order autoregressive models and yields the first parallel speedup for diffusion models in the high-accuracy regime, under the relatively mild assumption that the support of $mu$ is bounded. We introduce a novel technique to obtain our results: speculative rejection sampling. This technique leverages an auxiliary ``speculative''distribution~$ u$ that approximates~$mu$ to accelerate sampling. Our technique is inspired by the well-studied ``speculative decoding''techniques popular in large language models, but differs in key ways. Firstly, we use ``autospeculation,''namely we build the speculation $ u$ out of the same oracle that defines~$mu$. In contrast, speculative decoding typically requires a separate, faster, but potentially less accurate ``draft''model $ u$. Secondly, the key differentiating factor in our technique is that we make and accept speculations at a ``sequence''level rather than at the level of single (or a few) steps. This last fact is key to unlocking our parallel runtime of $widetilde{O}(n^{1/2})$.
Problem

Research questions and friction points this paper is trying to address.

Accelerating sampling from autoregressive and diffusion models
Reducing sequential sampling time through parallel oracle calls
Developing speculative rejection sampling for parallel sequence generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel sampling via speculative rejection sampling
Autospeculation using same oracle for approximation
Sequence-level speculation enabling square-root runtime
🔎 Similar Papers