🤖 AI Summary
This work addresses the high computational cost of Self-Consistency (SC) in large language model inference, which stems from enumerating all possible reasoning paths. To mitigate this, we propose PoLR, a plug-and-play pre-filtering method that requires no fine-tuning. PoLR leverages short-prefix clustering to identify the dominant path cluster and selectively expands only those paths. The approach is grounded in a theoretical analysis integrating prefix consistency, mutual information, and entropy, and is further enhanced with adaptive inference mechanisms such as Early-Stopping SC. Evaluated on benchmarks including GSM8K and MATH500, PoLR matches or even surpasses the accuracy of standard Self-Consistency while reducing token consumption by up to 60% and latency by up to 50%.
📝 Abstract
Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.