SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Large reasoning models (LRMs) suffer from high inference latency on complex tasks due to reliance on long, sequential chain-of-thought reasoning. Method: This paper introduces a novel “plan–parallel execution” alternating inference paradigm that dynamically identifies independent subtasks within the reasoning chain and schedules their parallel generation. To enable this, we design a natural language reasoning trace reorganization pipeline tailored for parallelization—comprising structured trace reconstruction, dynamic subtask decoupling, and a lightweight parallel scheduling mechanism—optimized end-to-end via post-training fine-tuning. Contribution/Results: On challenging reasoning benchmarks—including mathematical problem solving, GPQA, and Countdown—the approach preserves original model accuracy while reducing average output sequence length by 39% (for problems exceeding 8,000 tokens) and decreasing serial token generation by 45% and 65%, respectively. These gains translate to substantial improvements in inference efficiency without compromising reasoning capability.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought, resulting in long inference times before arriving at the final answer. To address this challenge, we introduce SPRINT, a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization during their reasoning process. SPRINT incorporates an innovative data curation pipeline that reorganizes natural language reasoning trajectories into structured rounds of long-horizon planning and parallel execution. By fine-tuning LRMs on a small amount of such curated data, the models learn to dynamically identify independent subtasks within extended reasoning processes and effectively execute them in parallel. Through extensive evaluations, we show that the models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to ~39% fewer sequential tokens on problems requiring more than 8000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks of GPQA and Countdown with up to 45% and 65% reduction in average sequential tokens for longer reasoning trajectories, while achieving the performance of the fine-tuned reasoning model.

Problem

Research questions and friction points this paper is trying to address.

Reducing long sequential chains in reasoning models

Enabling parallel execution in reasoning processes

Maintaining performance while decreasing inference time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic parallelization in reasoning models

Structured planning and parallel execution

Reduced sequential tokens via fine-tuning

🔎 Similar Papers

No similar papers found.