SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

📅 2026-02-10
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for speculative decoding (SD) suffer from insufficient task diversity, a lack of throughput-oriented evaluation, and reliance on high-level implementations that fail to reflect real-world deployment scenarios. To address these limitations, this work proposes SPEED-Bench, the first unified evaluation framework that jointly accounts for semantic diversity and multi-concurrency workloads. SPEED-Bench integrates production-grade inference engines such as vLLM and TensorRT-LLM and introduces two complementary evaluation datasets—qualitative and throughput-focused—to span practical serving conditions from low-latency to high-throughput regimes. The benchmark uncovers critical issues including throughput overestimation with synthetic inputs, biases induced by low-diversity data, and limitations of vocabulary pruning. It further quantifies the impact of various real-world factors on SD performance and has been open-sourced to advance standardized, practical evaluation of SD algorithms.
📝 Abstract
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding
benchmark
LLM inference
throughput evaluation
task diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Benchmark
Throughput Evaluation
Semantic Diversity
Production Integration
🔎 Similar Papers