🤖 AI Summary
Large language models (LLMs) suffer from a “tunnel vision” bottleneck during test-time compute scaling, where serial inference limits marginal performance gains despite increased computation. Method: We propose Native Parallel Thinking (NPT), a novel paradigm that trains models end-to-end to generate multiple independent and diverse reasoning paths in parallel, then fuses their outputs via a learnable aggregation mechanism—shifting compute scaling from serial depth to parallel width. Contribution/Results: NPT enables truly native parallel reasoning at inference time for the first time, substantially improving knowledge utilization efficiency and reasoning robustness. On multiple challenging reasoning benchmarks, NPT boosts accuracy by 12.3% (1.5B model) and 7.5% (7B model), with only a 7.1% latency overhead; notably, the smaller model surpasses larger baseline models in performance.
📝 Abstract
Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model's capability but a flaw in the scaling strategy itself, a phenomenon we term "Tunnel Vision", where a model's imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model's latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.