Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In long-chain reasoning, conventional token-level speculative decoding suffers from frequent rejection and redundant regeneration due to semantic equivalence without exact token matching; existing step-level methods still require re-generating rejected steps, yielding limited efficiency gains. This paper proposes step-level dynamic routing speculative decoding: a lightweight router dynamically selects—based on relative confidence advantages between draft and target model outputs per reasoning step—whether to directly reuse, rectify, or regenerate the step, thereby eliminating fine-grained token-by-token comparison and enabling efficient, semantically aligned verification. Evaluated on mathematical reasoning tasks, our method achieves up to 2× inference speedup with zero accuracy degradation and significantly reduces target model computational overhead.

Technology Category

Application Category

📝 Abstract

Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $sim2 imes$ at matched accuracy.

Problem

Research questions and friction points this paper is trying to address.

Improves reasoning efficiency in Large Language Models

Reduces unnecessary step regeneration in speculative decoding

Optimizes accuracy-cost trade-off with dynamic routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic routing based on model advantage prediction

Step-level speculative generation with lightweight router

Near-optimal efficiency-accuracy trade-offs in reasoning tasks

🔎 Similar Papers

Arbitrageurs' profits, LVR, and sandwich attacks: batch trading as an AMM design response