Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Existing local constraint decoding (LCD) methods for controllable text generation suffer from two key bottlenecks: high computational cost due to per-token constraint verification, and global distribution distortion or path blocking caused by myopic, locally optimal decisions. This paper introduces AdaptRS—a novel adaptive weighted rejection sampling framework—that achieves, for the first time, low-variance, unbiased importance weight estimation while seamlessly integrating with sequential Monte Carlo (SMC) to correct local decision biases. AdaptRS drastically reduces the number of constraint evaluations, and its computational overhead automatically decreases as model capability improves. Empirically, AdaptRS outperforms all state-of-the-art methods across diverse tasks—including text-to-SQL parsing, molecular generation, and JSON synthesis—delivering superior constraint expressivity, faster inference, and higher generation accuracy.

Technology Category

Application Category

📝 Abstract

The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.

Problem

Research questions and friction points this paper is trying to address.

Reduce constraint evaluation cost in language model generation

Correct global distribution distortion from local constraints

Improve runtime and performance for constrained generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive rejection sampling reduces constraint evaluations

Low-variance unbiased importance weights correction

Dynamic computation scales with model divergence

🔎 Similar Papers

Approximately Aligned Decoding