Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Large language models typically regenerate entire outputs autoregressively for editing tasks, resulting in significant inefficiency. This work proposes Copy-as-Decode, a structured decoding framework that formulates editing as a sequence of <copy> and <gen> primitives. A syntax-constrained finite-state machine ensures output validity, while parallel prefilling enables efficient reuse of input content and incremental KV cache updates, eliminating redundant computation. Notably, the approach achieves near-optimal performance without end-to-end training. On Qwen2.5, copying N tokens yields speedups of 6.8–303×; in ProbeEdit and HumanEvalPack-Fix, 74%–98% of target tokens are covered by line-level primitives, implying a theoretical end-to-end acceleration of up to 13.0×. An oracle implementation achieves 100% lossless round-trip fidelity.

Technology Category

Application Category

📝 Abstract

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.

Problem

Research questions and friction points this paper is trying to address.

LLM editing

copy mechanism

parallel prefill

grammar-constrained decoding

KV cache efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Copy-as-Decode

parallel prefill

grammar-constrained decoding