Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the fundamental limitation of large language models (LLMs) in relying exclusively on serial reasoning, thereby hindering native parallel cognition. We propose the first teacher-free self-evolving parallel reasoning framework. Methodologically, it integrates self-distillation reinforcement learning, a Parallel-Aware Policy Optimization (PAPO) algorithm, and a novel Parallel Reasoning (NPR) engine—achieved through graph-structured policy modeling and deep refactoring of the SGLang execution engine to enable coordinated memory and workflow parallelization. Our key contribution is the first demonstration of 100% genuine parallel inference: all reasoning steps activate simultaneously, with no implicit sequential simulation. Experiments on Qwen3-4B show an average performance gain of 24.5% across eight benchmarks and a 4.6× speedup in inference latency, significantly improving both efficiency and scalability of agent-based reasoning.

Technology Category

Application Category

📝 Abstract

We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enabling LLMs to self-evolve genuine parallel reasoning capabilities

Transforming models from sequential emulation to native parallel cognition

Achieving performance gains and inference speedups via parallel execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distilled progressive training paradigm for parallel cognition

Parallel-Aware Policy Optimization algorithm for adaptive decomposition

Robust NPR Engine refactors memory and flow control

🔎 Similar Papers

No similar papers found.