Motif 2 12.7B technical report

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Training large language models (LLMs) under computational resource constraints suffers from low representation efficiency and training instability. Method: We propose an efficient LLM training framework featuring grouped differential attention (decoupling signal and noise pathways), the MuonClip optimizer, PolyNorm activation, and the Parallel Muon parallelization algorithm, integrated with a curriculum-driven data scheduler and a three-stage supervised fine-tuning pipeline. Contribution/Results: Trained on 5.5 trillion tokens, our model achieves significant improvements in instruction generalization and linguistic understanding, matching or exceeding the performance of substantially larger models on major benchmarks—including MMLU, GSM8K, and HumanEval—while reducing computational overhead. This demonstrates that co-optimization of architecture, optimization algorithms, and training methodology is critical for enhancing both resource efficiency and capability scalability in LLM development.

Technology Category

Application Category

📝 Abstract

We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.

Problem

Research questions and friction points this paper is trying to address.

Improving efficiency of large language models under constrained compute budgets

Enhancing representational efficiency through disentangled attention pathways

Achieving competitive performance with smaller model size through optimized design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grouped Differential Attention disentangles signal and noise pathways

Curriculum-driven scheduler changes data composition ratio

MuonClip optimizer with fused kernels boosts training efficiency

🔎 Similar Papers

Mathematical Information Retrieval: Search and Question Answering