🤖 AI Summary
This work addresses the growing demand for efficient and flexible structured generation in dynamic tasks—such as tool calling—for large language model (LLM) agents. To this end, we propose XGrammar 2, a high-performance structured generation engine that substantially reduces the overhead of dynamic structured output through several key innovations: a novel TagDispatch mechanism for dynamic semantic dispatch, just-in-time (JIT) compilation, cross-grammar caching, an Earley-parser-based mask generation algorithm, and compression techniques for repetitive structures. Experimental results demonstrate that XGrammar 2 achieves over a 6× speedup compared to existing engines while introducing negligible latency when integrated into LLM inference pipelines, offering an efficient and low-overhead solution for dynamic structured generation.
📝 Abstract
Modern LLM agents are required to handle increasingly complex structured generation tasks, such as tool calling and conditional structured generation. These tasks are significantly more dynamic than predefined structures, posing new challenges to the current structured generation engines. In this paper, we propose XGrammar 2, a highly optimized structured generation engine for agentic LLMs. XGrammar 2 accelerates the mask generation for these dynamic structured generation tasks through a new dynamic dispatching semantics: TagDispatch. We further introduce a just-in-time (JIT) compilation method to reduce compilation time and a cross-grammar caching mechanism to leverage the common sub-structures across different grammars. Additionally, we extend the previous PDA-based mask generation algorithm to the Earley-parser-based one and design a repetition compression algorithm to handle repetition structures in grammars. Evaluation results show that XGrammar 2 can achieve more than 6x speedup over the existing structured generation engines. Integrated with an LLM inference engine, XGrammar 2 can handle dynamic structured generation tasks with near-zero overhead.