Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

๐Ÿ“… 2025-12-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the fundamental limitation of large language models (LLMs) in relying exclusively on serial reasoning, thereby hindering native parallel cognition. We propose the first teacher-free self-evolving parallel reasoning framework. Methodologically, it integrates self-distillation reinforcement learning, a Parallel-Aware Policy Optimization (PAPO) algorithm, and a novel Parallel Reasoning (NPR) engineโ€”achieved through graph-structured policy modeling and deep refactoring of the SGLang execution engine to enable coordinated memory and workflow parallelization. Our key contribution is the first demonstration of 100% genuine parallel inference: all reasoning steps activate simultaneously, with no implicit sequential simulation. Experiments on Qwen3-4B show an average performance gain of 24.5% across eight benchmarks and a 4.6ร— speedup in inference latency, significantly improving both efficiency and scalability of agent-based reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enabling LLMs to self-evolve genuine parallel reasoning capabilities
Transforming models from sequential emulation to native parallel cognition
Achieving performance gains and inference speedups via parallel execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distilled progressive training paradigm for parallel cognition
Parallel-Aware Policy Optimization algorithm for adaptive decomposition
Robust NPR Engine refactors memory and flow control
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tong Wu
NLCo Lab, Beijing Institute for General Artificial Intelligence (BIGAI)
Y
Yang Liu
NLCo Lab, Beijing Institute for General Artificial Intelligence (BIGAI)
Jun Bai
Jun Bai
Assistant professor
Computer aided drug discoveryMedical image analysisAI therapeutic target identification
Zixia Jia
Zixia Jia
BigAI
NLP
Shuyi Zhang
Shuyi Zhang
East China Normal University
Big data analysisSemi-supervised learningHigh-dimensional statisticsApplied data science
Z
Ziyong Lin
NLCo Lab, Beijing Institute for General Artificial Intelligence (BIGAI)
Yanting Wang
Yanting Wang
Penn State University
Trustworthy AI
S
Song-Chun Zhu
NLCo Lab, Beijing Institute for General Artificial Intelligence (BIGAI)
Z
Zilong Zheng
NLCo Lab, Beijing Institute for General Artificial Intelligence (BIGAI)