๐ค AI Summary
This work addresses the sampling efficiency bottleneck in both arbitrary-order autoregressive models and denoising diffusion models. Methodologically, under a bounded-support assumption, it introduces a novel acceleration framework leveraging parallel computation and speculative rejection sampling: it invokes conditional marginal (for autoregressive models) or conditional mean (for diffusion models) oracles in parallel, and employs a โself-speculativeโ mechanism to construct an auxiliary distribution enabling sequence-level speculation and batched acceptance. Theoretically, it is the first to reduce the expected sampling time for both model classes from the standard $widetilde{O}(n)$ to $widetilde{O}(n^{1/2})$; for autoregressive models specifically, it improves upon prior $widetilde{O}(n^{2/3})$ bounds to $widetilde{O}(n^{1/2})$. Crucially, it is also the first to achieve substantial parallel speedup for denoising diffusion models under high-precision requirements, thereby overcoming their inherent sequential constraint.
๐ Abstract
We present parallel algorithms to accelerate sampling via counting in two settings: any-order autoregressive models and denoising diffusion models. An any-order autoregressive model accesses a target distribution $mu$ on $[q]^n$ through an oracle that provides conditional marginals, while a denoising diffusion model accesses a target distribution $mu$ on $mathbb{R}^n$ through an oracle that provides conditional means under Gaussian noise. Standard sequential sampling algorithms require $widetilde{O}(n)$ time to produce a sample from $mu$ in either setting. We show that, by issuing oracle calls in parallel, the expected sampling time can be reduced to $widetilde{O}(n^{1/2})$. This improves the previous $widetilde{O}(n^{2/3})$ bound for any-order autoregressive models and yields the first parallel speedup for diffusion models in the high-accuracy regime, under the relatively mild assumption that the support of $mu$ is bounded. We introduce a novel technique to obtain our results: speculative rejection sampling. This technique leverages an auxiliary ``speculative''distribution~$
u$ that approximates~$mu$ to accelerate sampling. Our technique is inspired by the well-studied ``speculative decoding''techniques popular in large language models, but differs in key ways. Firstly, we use ``autospeculation,''namely we build the speculation $
u$ out of the same oracle that defines~$mu$. In contrast, speculative decoding typically requires a separate, faster, but potentially less accurate ``draft''model $
u$. Secondly, the key differentiating factor in our technique is that we make and accept speculations at a ``sequence''level rather than at the level of single (or a few) steps. This last fact is key to unlocking our parallel runtime of $widetilde{O}(n^{1/2})$.