🤖 AI Summary
Large language models (LLMs) struggle to generate syntactically valid structured outputs, while existing grammar-constrained decoding methods incur prohibitive preprocessing overhead.
Method: We propose a novel context-free grammar (CFG)-guided constrained decoding algorithm that jointly models subword tokenization and CFG syntax via alignment-aware token masking. Our approach introduces dynamic mask generation and efficient offline reachability analysis, enabling tight coordination between offline preprocessing and online decoding.
Contribution/Results: The method accelerates preprocessing by 17.71×—reducing it from tens of minutes to under one minute—while maintaining theoretical soundness and state-of-the-art mask computation efficiency, with no measurable increase in online decoding latency. It supports flexible, user-defined grammars and generalizes across structured formats including programming languages, JSON, and XML. This work delivers a lightweight, universal, and high-fidelity syntactic constraint framework for controllable LLM generation.
📝 Abstract
Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.