Faster and Better LLMs via Latency-Aware Test-Time Scaling

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing research on test-time scaling (TTS) prioritizes computational optimality while overlooking actual inference latency in latency-sensitive scenarios—leading to computationally efficient strategies that do not necessarily minimize end-to-end delay. This work is the first to systematically demonstrate that “computationally optimal TTS ≠ latency-optimal TTS” and proposes a latency-aware TTS paradigm. Our approach jointly integrates branch-level parallelism, sequence-level parallelism, and speculative decoding, augmented by a dynamic resource allocation mechanism. Evaluated on the MATH-500 benchmark, our method achieves 82.3% accuracy within one minute for a 32B model and 72.4% within ten seconds for a 3B model—substantially outperforming conventional TTS methods. This work establishes a practical, end-to-end optimized framework for latency-critical large language model inference, enabling efficient co-optimization of computation and latency.

Technology Category

Application Category

📝 Abstract

Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.

Problem

Research questions and friction points this paper is trying to address.

Optimizing Test-Time Scaling for lower latency in LLMs

Balancing compute-optimal TTS with latency-sensitive requirements

Enhancing LLM speed and accuracy via parallel inference approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Branch-wise parallelism for concurrent inference

Sequence-wise parallelism via speculative decoding

Latency-optimal resource allocation in TTS

🔎 Similar Papers

No similar papers found.