Parallel Sampling via Autospeculation

๐Ÿ“… 2025-11-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the sampling efficiency bottleneck in both arbitrary-order autoregressive models and denoising diffusion models. Methodologically, under a bounded-support assumption, it introduces a novel acceleration framework leveraging parallel computation and speculative rejection sampling: it invokes conditional marginal (for autoregressive models) or conditional mean (for diffusion models) oracles in parallel, and employs a โ€œself-speculativeโ€ mechanism to construct an auxiliary distribution enabling sequence-level speculation and batched acceptance. Theoretically, it is the first to reduce the expected sampling time for both model classes from the standard $widetilde{O}(n)$ to $widetilde{O}(n^{1/2})$; for autoregressive models specifically, it improves upon prior $widetilde{O}(n^{2/3})$ bounds to $widetilde{O}(n^{1/2})$. Crucially, it is also the first to achieve substantial parallel speedup for denoising diffusion models under high-precision requirements, thereby overcoming their inherent sequential constraint.

Technology Category

Application Category

๐Ÿ“ Abstract
We present parallel algorithms to accelerate sampling via counting in two settings: any-order autoregressive models and denoising diffusion models. An any-order autoregressive model accesses a target distribution $mu$ on $[q]^n$ through an oracle that provides conditional marginals, while a denoising diffusion model accesses a target distribution $mu$ on $mathbb{R}^n$ through an oracle that provides conditional means under Gaussian noise. Standard sequential sampling algorithms require $widetilde{O}(n)$ time to produce a sample from $mu$ in either setting. We show that, by issuing oracle calls in parallel, the expected sampling time can be reduced to $widetilde{O}(n^{1/2})$. This improves the previous $widetilde{O}(n^{2/3})$ bound for any-order autoregressive models and yields the first parallel speedup for diffusion models in the high-accuracy regime, under the relatively mild assumption that the support of $mu$ is bounded. We introduce a novel technique to obtain our results: speculative rejection sampling. This technique leverages an auxiliary ``speculative''distribution~$ u$ that approximates~$mu$ to accelerate sampling. Our technique is inspired by the well-studied ``speculative decoding''techniques popular in large language models, but differs in key ways. Firstly, we use ``autospeculation,''namely we build the speculation $ u$ out of the same oracle that defines~$mu$. In contrast, speculative decoding typically requires a separate, faster, but potentially less accurate ``draft''model $ u$. Secondly, the key differentiating factor in our technique is that we make and accept speculations at a ``sequence''level rather than at the level of single (or a few) steps. This last fact is key to unlocking our parallel runtime of $widetilde{O}(n^{1/2})$.
Problem

Research questions and friction points this paper is trying to address.

Accelerating sampling from autoregressive and diffusion models
Reducing sequential sampling time through parallel oracle calls
Developing speculative rejection sampling for parallel sequence generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel sampling via speculative rejection sampling
Autospeculation using same oracle for approximation
Sequence-level speculation enabling square-root runtime
๐Ÿ”Ž Similar Papers
No similar papers found.
Nima Anari
Nima Anari
Stanford University
AlgorithmsProbability TheoryTheoretical Computer Science
C
Carlo Baronio
Stanford University
C
CJ Chen
University of Arizona
A
Alireza Haqi
Stanford University
Frederic Koehler
Frederic Koehler
University of Chicago
Theoretical Computer ScienceMachine LearningHigh-Dimensional Statistics
A
Anqi Li
Stanford University
T
T. Vuong
UC Berkeley