🤖 AI Summary
This work proposes Parallel-Probe, a novel approach to parallel inference that addresses the high computational cost and the underutilization of global branch dynamics in existing methods. By introducing a 2D probing mechanism that periodically samples intermediate outputs from all branches, the study uncovers key patterns—including non-monotonic width-depth trade-offs, heterogeneous branch lengths, and the early stabilization of global consensus. Leveraging these insights, the authors design a training-free online controller that dynamically adjusts inference depth via a consensus-based early-exit strategy and modulates width through bias-driven branch pruning. Experiments across three benchmarks and multiple models demonstrate that Parallel-Probe significantly outperforms standard majority voting, reducing sequential token consumption by up to 35.8% and total token cost by up to 25.8%, while maintaining competitive accuracy.
📝 Abstract
Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce $\textbf{{Parallel-Probe}}$, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to $\textbf{35.8}$% and total token cost by over $\textbf{25.8}$% while maintaining competitive accuracy.