🤖 AI Summary
Large language models (LLMs) enhance reasoning accuracy via Chain-of-Thought (CoT) and its extension Long CoT, yet suffer from excessive token consumption and high latency due to lengthy intermediate reasoning traces—hindering practical deployment. This paper proposes Fractured Sampling, a dynamic, truncated, multi-trajectory, multi-solution inference sampling framework. We empirically discover—counterintuitively—that truncating CoT traces preserves, and often surpasses, the accuracy of full-length CoT. Fractured Sampling enables joint optimization through three orthogonal controls: reasoning depth, number of parallel trajectories, and solutions per trajectory. It integrates on-the-fly trace truncation, parallel multi-path generation, and Pass@k–driven computational allocation. Evaluated across five reasoning benchmarks and multiple model scales, it substantially improves the accuracy–cost trade-off, achieving steep log-linear scaling gains of Pass@k with respect to token budget.
📝 Abstract
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning.