ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Large language models (LLMs) suffer from a “tunnel vision” bottleneck during test-time compute scaling, where serial inference limits marginal performance gains despite increased computation. Method: We propose Native Parallel Thinking (NPT), a novel paradigm that trains models end-to-end to generate multiple independent and diverse reasoning paths in parallel, then fuses their outputs via a learnable aggregation mechanism—shifting compute scaling from serial depth to parallel width. Contribution/Results: NPT enables truly native parallel reasoning at inference time for the first time, substantially improving knowledge utilization efficiency and reasoning robustness. On multiple challenging reasoning benchmarks, NPT boosts accuracy by 12.3% (1.5B model) and 7.5% (7B model), with only a 7.1% latency overhead; notably, the smaller model surpasses larger baseline models in performance.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model's capability but a flaw in the scaling strategy itself, a phenomenon we term "Tunnel Vision", where a model's imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model's latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.

Problem

Research questions and friction points this paper is trying to address.

Overcoming Tunnel Vision in sequential LLM reasoning paths

Scaling test-time compute via parallel thought generation

Improving reasoning accuracy with minimal latency overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multiple diverse reasoning paths in parallel

Synthesizes parallel thoughts into superior final answer

Scales compute widthwise to overcome sequential bottlenecks

🔎 Similar Papers

No similar papers found.