Parallel Sampling via Autospeculation

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the sampling efficiency bottleneck in both arbitrary-order autoregressive models and denoising diffusion models. Methodologically, under a bounded-support assumption, it introduces a novel acceleration framework leveraging parallel computation and speculative rejection sampling: it invokes conditional marginal (for autoregressive models) or conditional mean (for diffusion models) oracles in parallel, and employs a “self-speculative” mechanism to construct an auxiliary distribution enabling sequence-level speculation and batched acceptance. Theoretically, it is the first to reduce the expected sampling time for both model classes from the standard $widetilde{O}(n)$ to $widetilde{O}(n^{1/2})$; for autoregressive models specifically, it improves upon prior $widetilde{O}(n^{2/3})$ bounds to $widetilde{O}(n^{1/2})$. Crucially, it is also the first to achieve substantial parallel speedup for denoising diffusion models under high-precision requirements, thereby overcoming their inherent sequential constraint.

Technology Category

Application Category

📝 Abstract

We present parallel algorithms to accelerate sampling via counting in two settings: any-order autoregressive models and denoising diffusion models. An any-order autoregressive model accesses a target distribution $mu$ on $[q]^n$ through an oracle that provides conditional marginals, while a denoising diffusion model accesses a target distribution $mu$ on $mathbb{R}^n$ through an oracle that provides conditional means under Gaussian noise. Standard sequential sampling algorithms require $widetilde{O}(n)$ time to produce a sample from $mu$ in either setting. We show that, by issuing oracle calls in parallel, the expected sampling time can be reduced to $widetilde{O}(n^{1/2})$. This improves the previous $widetilde{O}(n^{2/3})$ bound for any-order autoregressive models and yields the first parallel speedup for diffusion models in the high-accuracy regime, under the relatively mild assumption that the support of $mu$ is bounded. We introduce a novel technique to obtain our results: speculative rejection sampling. This technique leverages an auxiliary ``speculative''distribution~$ u$ that approximates~$mu$ to accelerate sampling. Our technique is inspired by the well-studied ``speculative decoding''techniques popular in large language models, but differs in key ways. Firstly, we use ``autospeculation,''namely we build the speculation $ u$ out of the same oracle that defines~$mu$. In contrast, speculative decoding typically requires a separate, faster, but potentially less accurate ``draft''model $ u$. Secondly, the key differentiating factor in our technique is that we make and accept speculations at a ``sequence''level rather than at the level of single (or a few) steps. This last fact is key to unlocking our parallel runtime of $widetilde{O}(n^{1/2})$.

Problem

Research questions and friction points this paper is trying to address.

Accelerating sampling from autoregressive and diffusion models

Reducing sequential sampling time through parallel oracle calls

Developing speculative rejection sampling for parallel sequence generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel sampling via speculative rejection sampling

Autospeculation using same oracle for approximation

Sequence-level speculation enabling square-root runtime

🔎 Similar Papers

Multiple importance sampling for stochastic gradient estimation