🤖 AI Summary
Addressing the challenge of balancing model lightweightness and high performance for on-device deployment, this paper introduces Megrez2—a sparse mixture-of-experts (MoE) language model architecture designed specifically for edge devices. Its core innovations are cross-layer expert sharing and pre-gating routing: the former enables reuse of expert modules across adjacent Transformer layers to reduce parameter redundancy, while the latter predicts expert activation patterns prior to routing to accelerate inference. With only 3B activated parameters and 7.5B total stored parameters, Megrez2 matches or surpasses larger models across language understanding, instruction following, mathematical reasoning, and code generation. Complemented by an optimized on-device inference engine and a training paradigm integrating supervised fine-tuning with verifiable reward-based reinforcement learning, Megrez2 ensures robust real-world deployment performance.
📝 Abstract
We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model's capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.