Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address high computational overhead and latency in inference-intensive scenarios, this paper proposes a 7B multimodal large language model optimized for Ascend NPU deployment. Methodologically, we introduce: (1) a novel dual-system inference architecture with complexity-aware automatic mode switching; (2) a multi-source adaptive reward system (MARS) coupled with an iterative distillation and model fusion training paradigm; and (3) NPU-specific optimizations—including stale-synchronous parallelism, priority-aware data queuing, a lightweight evaluator, and deterministic metric-fused reward modeling. Evaluated on AIME 2024, GPQA, and LiveCodeBench, our model significantly outperforms same-scale baselines (e.g., Qwen3-8B, GLM4-9B) in reasoning quality while maintaining low inference latency—achieving state-of-the-art performance under real-world NPU constraints.

Technology Category

Application Category

📝 Abstract

This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a"fast"mode for routine queries and a deeper"slow"mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in reasoning-optimized LLMs

Minimizing inference latency for efficient LLM performance

Balancing fast and slow thinking modes for complex tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training framework for efficient LLM

Dual-system fast-slow mode for dynamic reasoning

Multi-source Adaptive Reward System guiding RL

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting