Constrained Decoding with Speculative Lookaheads

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing constrained decoding methods for large language models (LLMs) struggle to balance inference efficiency and constraint satisfaction—e.g., CDLH incurs high computational overhead due to token-wise lookahead, while greedy decoding suffers from low constraint adherence. Method: This paper introduces the first framework integrating speculative decoding into constrained decoding. It employs a lightweight draft model to generate candidate sequences, which are then jointly verified by the target LLM and a task-specific reward function, enabling constraint-aware, offloaded lookahead computation. Contribution/Results: Evaluated across two constrained generation tasks and three major LLM families, our approach achieves 2.2×–12.15× speedup over baseline constrained decoding, with constraint satisfaction rates approaching those of CDLH and substantially outperforming greedy decoding. The core contribution is the establishment of a novel, efficient, general-purpose, and scalable paradigm for constrained decoding—marking the first application of speculative decoding to this domain.

Technology Category

Application Category

📝 Abstract
Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.
Problem

Research questions and friction points this paper is trying to address.

Improves LLM inference efficiency
Reduces computational burden significantly
Maintains high constraint satisfaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative lookaheads enhance decoding
Draft LLM reduces computational burden
Task-specific reward functions maintain performance
🔎 Similar Papers
No similar papers found.