Megrez2 Technical Report

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of balancing model lightweightness and high performance for on-device deployment, this paper introduces Megrez2—a sparse mixture-of-experts (MoE) language model architecture designed specifically for edge devices. Its core innovations are cross-layer expert sharing and pre-gating routing: the former enables reuse of expert modules across adjacent Transformer layers to reduce parameter redundancy, while the latter predicts expert activation patterns prior to routing to accelerate inference. With only 3B activated parameters and 7.5B total stored parameters, Megrez2 matches or surpasses larger models across language understanding, instruction following, mathematical reasoning, and code generation. Complemented by an optimized on-device inference engine and a training paradigm integrating supervised fine-tuning with verifiable reward-based reinforcement learning, Megrez2 ensures robust real-world deployment performance.

Technology Category

Application Category

📝 Abstract
We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model's capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.
Problem

Research questions and friction points this paper is trying to address.

Optimizes lightweight language model for device deployment
Reduces parameters via cross-layer expert sharing
Enhances efficiency with pre-gated routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer expert sharing reduces parameters
Pre-gated routing enables efficient inference
Lightweight architecture maintains high performance
🔎 Similar Papers
No similar papers found.
B
Boxun Li
Infinigence-AI
Y
Yadong Li
Infinigence-AI
Z
Zhiyuan Li
Infinigence-AI
C
Congyi Liu
Infinigence-AI
Weilin Liu
Weilin Liu
University of Ottawa
Microwave PhotonicsPhotonic Integrated CircuitsSilicon PhotonicsAll-optical signal processing
G
Guowei Niu
Infinigence-AI
Zheyue Tan
Zheyue Tan
Aalto University
H
Haiyang Xu
Infinigence-AI
Z
Zhuyu Yao
Infinigence-AI
Tao Yuan
Tao Yuan
University of California, Los Angeles
Computer VisionArtificial Intelligence
D
Dong Zhou
Infinigence-AI
Y
Yueqing Zhuang
Infinigence-AI
B
Bo Zhao
Aalto University
Guohao Dai
Guohao Dai
Associate Professor of Shanghai Jiao Tong University
Sparse ComputationLarge-scale Graph ProcessingFPGACircuits and Systems
Y
Yu Wang
Tsinghua University