TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing batched speculative decoding methods either optimize per-request or apply uniform strategies across the entire batch, resulting in low draft token acceptance rates, high verification overhead, and suboptimal resource utilization. This paper introduces TETRIS—the first framework to jointly optimize draft token selection *per request* at the batch level. TETRIS employs probabilistic modeling and a greedy algorithm to dynamically select, for each request, the draft tokens most likely to be accepted, under a theoretically derived upper bound on acceptance rate and a parallel verification mechanism. Its key innovation lies in balancing per-request acceptance probability with global computational efficiency, overcoming fundamental limitations of prior approaches. We provide theoretical guarantees on optimality under the proposed constraints. Experiments across multiple LLMs demonstrate significant improvements in both acceptance rate and throughput—particularly beneficial in inference-resource-constrained deployment scenarios.

Technology Category

Application Category

📝 Abstract

We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Optimizes batch speculative decoding throughput

Reduces wasted computing resources in LLMs

Improves draft token acceptance rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes batch speculative decoding

Selects promising draft tokens

Increases token acceptance rate

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

2024-08-16arXiv.orgCitations: 4

Authors to Follow