A1: Asynchronous Test-Time Scaling via Conformal Prediction

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high synchronization overhead, severe memory bottlenecks, and substantial latency in long-chain reasoning for large language models (LLMs) under speculative decoding, this paper proposes A1—the first framework integrating conformal prediction with asynchronous test-time scaling. Methodologically, A1 introduces an online calibration mechanism and a three-stage rejection sampling strategy to enable statistically reliable, low-overhead asynchronous inference scheduling; it supports both serial and parallel expansion, breaking the constraints of conventional synchronous paradigms. Through high arithmetic intensity optimization and dynamic confidence control, A1 achieves up to 56.7× speedup and 4.14× throughput improvement across multiple mathematical reasoning benchmarks, while significantly reducing latency and GPU memory consumption—all without compromising generation accuracy.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative decoding with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable LLM inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.
Problem

Research questions and friction points this paper is trying to address.

Reduces synchronization overhead in test-time scaling
Addresses memory bottlenecks and latency issues
Enables asynchronous inference with statistical guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous inference framework
Online calibration strategy
Three-stage rejection sampling pipeline
🔎 Similar Papers
No similar papers found.