Faster and Better LLMs via Latency-Aware Test-Time Scaling

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research on test-time scaling (TTS) prioritizes computational optimality while overlooking actual inference latency in latency-sensitive scenarios—leading to computationally efficient strategies that do not necessarily minimize end-to-end delay. This work is the first to systematically demonstrate that “computationally optimal TTS ≠ latency-optimal TTS” and proposes a latency-aware TTS paradigm. Our approach jointly integrates branch-level parallelism, sequence-level parallelism, and speculative decoding, augmented by a dynamic resource allocation mechanism. Evaluated on the MATH-500 benchmark, our method achieves 82.3% accuracy within one minute for a 32B model and 72.4% within ten seconds for a 3B model—substantially outperforming conventional TTS methods. This work establishes a practical, end-to-end optimized framework for latency-critical large language model inference, enabling efficient co-optimization of computation and latency.

Technology Category

Application Category

📝 Abstract
Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.
Problem

Research questions and friction points this paper is trying to address.

Optimizing Test-Time Scaling for lower latency in LLMs
Balancing compute-optimal TTS with latency-sensitive requirements
Enhancing LLM speed and accuracy via parallel inference approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Branch-wise parallelism for concurrent inference
Sequence-wise parallelism via speculative decoding
Latency-optimal resource allocation in TTS
🔎 Similar Papers
No similar papers found.