🤖 AI Summary
This work addresses the inefficiency of Tree-of-Thought (ToT) reasoning, which suffers from a synchronization bottleneck due to its reliance on reward signals, hindering effective parallelization. The authors propose SPEX, the first framework to systematically optimize ToT inference efficiency by integrating intra-query speculative path selection, inter-query dynamic resource allocation, and adaptive early stopping, thereby achieving high parallelism and low latency. Implemented within SGLang, SPEX synergistically combines speculative path expansion, dynamic budgeting, and adaptive pruning with token-level speculative decoding. Experiments demonstrate that SPEX accelerates diverse ToT algorithms and large language models by 1.2–3.0×, and when combined with token-level speculative decoding, achieves up to 4.1× end-to-end speedup.
📝 Abstract
Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored.
To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree.
We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves $1.2 \sim 3 \times$ speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to $4.1\times$. Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs.