Flexible and Efficient Grammar-Constrained Decoding

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) struggle to generate syntactically valid structured outputs, while existing grammar-constrained decoding methods incur prohibitive preprocessing overhead. Method: We propose a novel context-free grammar (CFG)-guided constrained decoding algorithm that jointly models subword tokenization and CFG syntax via alignment-aware token masking. Our approach introduces dynamic mask generation and efficient offline reachability analysis, enabling tight coordination between offline preprocessing and online decoding. Contribution/Results: The method accelerates preprocessing by 17.71×—reducing it from tens of minutes to under one minute—while maintaining theoretical soundness and state-of-the-art mask computation efficiency, with no measurable increase in online decoding latency. It supports flexible, user-defined grammars and generalizes across structured formats including programming languages, JSON, and XML. This work delivers a lightweight, universal, and high-fidelity syntactic constraint framework for controllable LLM generation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Problem

Research questions and friction points this paper is trying to address.

Improves grammar-constrained decoding efficiency

Reduces preprocessing time for LLMs

Ensures syntactic rule compliance in outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fast offline preprocessing

Efficient online mask computation

Grammar-constrained decoding algorithm

🔎 Similar Papers

No similar papers found.

Authors to Follow