TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding

πŸ“… 2025-11-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In dynamic speculative decoding, the number of candidate tokens is typically manually configured, leading to poor generalization across models and tasks. Method: This paper proposes TapOutβ€”a parameter-free algorithm that introduces a multi-armed bandit (MAB) framework for online, adaptive, and training-free selection of optimal candidate sequence length. TapOut treats lightweight, parameter-free speculative strategies as β€œarms” and dynamically balances exploration and exploitation based on real-time decoding rewards, requiring no hyperparameter tuning. Contribution/Results: Extensive experiments across diverse large language models (e.g., Llama, Qwen) and benchmark datasets demonstrate that TapOut is plug-and-play: it achieves state-of-the-art speedup while preserving generation quality, and significantly enhances robustness and generalization across architectures and tasks.

Technology Category

Application Category

πŸ“ Abstract
Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
Problem

Research questions and friction points this paper is trying to address.

Determining optimal token draft count for speculative decoding acceleration
Replacing hand-tuned thresholds with automated bandit-based policy selection
Achieving model-agnostic speedup without hyperparameter tuning requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit-based dynamic speculation policy selection
Training-free plug-and-play algorithm for speculative decoding
Parameter-free strategy selection using reward exploration
πŸ”Ž Similar Papers
No similar papers found.