🤖 AI Summary
Existing benchmarks for speculative decoding (SD) suffer from insufficient task diversity, a lack of throughput-oriented evaluation, and reliance on high-level implementations that fail to reflect real-world deployment scenarios. To address these limitations, this work proposes SPEED-Bench, the first unified evaluation framework that jointly accounts for semantic diversity and multi-concurrency workloads. SPEED-Bench integrates production-grade inference engines such as vLLM and TensorRT-LLM and introduces two complementary evaluation datasets—qualitative and throughput-focused—to span practical serving conditions from low-latency to high-throughput regimes. The benchmark uncovers critical issues including throughput overestimation with synthetic inputs, biases induced by low-diversity data, and limitations of vocabulary pruning. It further quantifies the impact of various real-world factors on SD performance and has been open-sourced to advance standardized, practical evaluation of SD algorithms.
📝 Abstract
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.