🤖 AI Summary
This work addresses the high variability and long-tail latency in Monte Carlo Tree Search (MCTS) during test-time computation expansion, which stems from inefficient search trajectories and limits the effectiveness of existing optimizations when search progress stalls. The authors propose a negative early-exit mechanism that proactively prunes unproductive trajectories and integrates an adaptive boosting strategy to dynamically reallocate the freed computational resources, thereby mitigating resource contention among parallel searches. Implemented within the vLLM inference framework, this approach significantly reduces end-to-end p99 latency and improves system throughput while preserving the accuracy of large language model inference.
📝 Abstract
Monte Carlo Tree Search (MCTS) is an effective test-time compute scaling (TTCS) method for improving the reasoning performance of large language models, but its highly variable execution time leads to severe long-tail latency in practice. Existing optimizations such as positive early exit, reduce latency in favorable cases but are less effective when search continues without meaningful progress. We introduce {\it negative early exit}, which prunes unproductive MCTS trajectories, and an {\it adaptive boosting mechanism} that reallocates reclaimed computation to reduce resource contention among concurrent searches. Integrated into vLLM, these techniques substantially reduce p99 end-to-end latency while improving throughput and maintaining reasoning accuracy.