Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional pushdown automaton (PDA)-based constrained decoding for large language models (LLMs) incurs high runtime overhead during batched inference when generating structured outputs (e.g., JSON) under LR(1) grammar constraints, due to dynamic path exploration at inference time. Method: This paper proposes a deterministic pushdown automaton (DPDA)-based constrained decoding framework. It introduces a novel, lossless conversion algorithm from LR(1) parsing tables to DPDAs, augmented with prefix-conditioned edge precomputation and state compression to enable parallel, exploration-free transition decisions. Contribution/Results: The method eliminates runtime context-sensitive token processing entirely and integrates seamlessly into mainstream LLM inference frameworks. Experiments on standard JSON generation tasks show up to 40% reduction in time-per-output-token (TPOT) and a 36% throughput improvement, demonstrating significant efficiency gains without compromising output correctness.

Technology Category

Application Category

📝 Abstract
Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.
Problem

Research questions and friction points this paper is trying to address.

Optimizing structured LLM generation for LR(1) grammars efficiently
Reducing runtime overhead in context-dependent token processing
Transforming LR(1) graphs into deterministic pushdown automata (DPDA)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Precomputes prefix-conditioned edges for ahead-of-time analysis
Transforms LR(1) graphs into deterministic pushdown automata
Reduces runtime overhead with parallel transition processing
🔎 Similar Papers
No similar papers found.
Junyi Chen
Junyi Chen
Shanghai Jiao Tong University
Generative AIMultimodal Learning
S
Shihao Bai
Sensetime Research
Z
Zaijun Wang
Sensetime Research
S
Siyu Wu
Beihang University
C
Chuheng Du
Shanghai Jiao Tong University
H
Hailong Yang
Beihang University
R
Ruihao Gong
Beihang University, Sensetime Research
Shengzhong Liu
Shengzhong Liu
Shanghai Jiao Tong University
F
Fan Wu
Shanghai Jiao Tong University
Guihai Chen
Guihai Chen
Professor of Computer Science
Computer Science and Technology