🤖 AI Summary
Structured decoding—e.g., for HTML or JSON generation—faces efficiency bottlenecks due to syntactic compilation, state tracking, and mask construction. To address this, we propose a constraint decomposition framework grounded in prior structural knowledge: syntax constraints are decoupled into static, pre-compilable components and dynamic runtime parameters; compositional regular operators replace traditional pushdown automata to reduce state-transition overhead. We further introduce grammar-fragment-driven constraint decomposition, domain-aware simplification, and mask caching to realize a lightweight decoding engine. Experiments demonstrate that our method achieves up to 250× decoding speedup while preserving generation correctness—significantly outperforming existing structured decoding systems. The implementation is open-sourced.
📝 Abstract
Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar's source code is publicly available at https://github.com/wrran/wgrammar.