🤖 AI Summary
Training large language models (LLMs) under computational resource constraints suffers from low representation efficiency and training instability. Method: We propose an efficient LLM training framework featuring grouped differential attention (decoupling signal and noise pathways), the MuonClip optimizer, PolyNorm activation, and the Parallel Muon parallelization algorithm, integrated with a curriculum-driven data scheduler and a three-stage supervised fine-tuning pipeline. Contribution/Results: Trained on 5.5 trillion tokens, our model achieves significant improvements in instruction generalization and linguistic understanding, matching or exceeding the performance of substantially larger models on major benchmarks—including MMLU, GSM8K, and HumanEval—while reducing computational overhead. This demonstrates that co-optimization of architecture, optimization algorithms, and training methodology is critical for enhancing both resource efficiency and capability scalability in LLM development.
📝 Abstract
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.