🤖 AI Summary
Existing grammar-constrained speculative decoding methods are limited to sampling from local projected distributions and fail to approximate the user-specified grammatical conditional distribution, resulting in significant bias. This work proposes a speculative decoding framework grounded in the Doob h-transform, which for the first time identifies the future validity function Φ as the critical correction statistic, thereby circumventing the limitations imposed by the GAD impossibility result. The paper establishes a theoretical link between Φ estimation and distribution fidelity. Efficient estimation of Φ is achieved on Dyck and finite JSON languages through dynamic programming, local masking corrections, and hierarchical Φ estimation algorithms: OneStep estimation reduces total variation distance by 14% on Dyck languages, dynamic programming achieves a 97% reduction, and finite-language correction drives JSON errors down to numerical precision levels—all while maintaining low inference overhead.
📝 Abstract
Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejection, and rollback soundness samples from the locally projected distribution $μ^{\mathrm{proj}}$ rather than the grammar-conditional distribution $μ^\star$. This extends the GAD impossibility result to speculative decoding; on Dyck grammars with Qwen3-8B, the total-variation gap can reach 0.996. We identify the future-validity function $Φ_t(y)=\Pr_p[\mathrm{valid\ completion}\mid y]$ as the missing correction statistic. The target distribution is a Doob transform of the base model with $h=Φ$, while local masking corresponds to setting $h$ to one. With exact $Φ$, our oracle decoder FVO-Spec samples exactly from $μ^\star$; with approximate $Φ$, we bound the resulting total-variation error. Because exact future validity is hard for general context-free grammars, we evaluate estimator hierarchies on tractable Dyck and finite JSON languages. OneStep reduces Dyck TV by 14% with under 1% throughput overhead, exact dynamic programming reduces it by 97%, and finite-language correction closes JSON gaps to numerical precision. All fidelity claims are scoped to enumerable grammars and token tries.