WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Structured decoding—e.g., for HTML or JSON generation—faces efficiency bottlenecks due to syntactic compilation, state tracking, and mask construction. To address this, we propose a constraint decomposition framework grounded in prior structural knowledge: syntax constraints are decoupled into static, pre-compilable components and dynamic runtime parameters; compositional regular operators replace traditional pushdown automata to reduce state-transition overhead. We further introduce grammar-fragment-driven constraint decomposition, domain-aware simplification, and mask caching to realize a lightweight decoding engine. Experiments demonstrate that our method achieves up to 250× decoding speedup while preserving generation correctness—significantly outperforming existing structured decoding systems. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract

Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar's source code is publicly available at https://github.com/wrran/wgrammar.

Problem

Research questions and friction points this paper is trying to address.

Accelerate structured decoding in LLMs

Reduce efficiency bottlenecks in grammar processing

Leverage prior knowledge for output structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose constraints into static and dynamic components

Use compositional operators for regular formats

Integrate domain-aware simplification and mask caching

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling