Accelerated Test-Time Scaling with Model-Free Speculative Sampling

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the high computational cost and low efficiency of test-time scaling methods (e.g., best-of-N, tree search) for large language models, this paper proposes STAND—the first model-agnostic, stochastic adaptive N-gram speculative decoding framework. Its core contributions are: (1) a stochastic draft generation strategy requiring no auxiliary models; (2) a logit-level N-gram memory module that preserves probabilistic structure and enables efficient reuse; and (3) a hybrid approach combining Gumbel-Top-K sampling with data-driven tree construction to enhance draft quality and acceptance rate. Evaluated on AIME-2024, GPQA-Diamond, and LiveCodeBench, STAND reduces inference latency by 60–65% and improves throughput by 14–28% over state-of-the-art methods. In single-trajectory settings, latency drops by 48–58%. STAND requires zero training and is fully plug-and-play.

Technology Category

Application Category

📝 Abstract

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in language model reasoning

Improving inference speed without accuracy loss

Enhancing throughput in speculative decoding methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-free speculative decoding with N-grams

Stochastic drafting and Gumbel-Top-K sampling

Plug-and-play acceleration without training

🔎 Similar Papers

2024-08-24arXiv.orgCitations: 0