Motif 2 12.7B technical report

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Training large language models (LLMs) under computational resource constraints suffers from low representation efficiency and training instability. Method: We propose an efficient LLM training framework featuring grouped differential attention (decoupling signal and noise pathways), the MuonClip optimizer, PolyNorm activation, and the Parallel Muon parallelization algorithm, integrated with a curriculum-driven data scheduler and a three-stage supervised fine-tuning pipeline. Contribution/Results: Trained on 5.5 trillion tokens, our model achieves significant improvements in instruction generalization and linguistic understanding, matching or exceeding the performance of substantially larger models on major benchmarks—including MMLU, GSM8K, and HumanEval—while reducing computational overhead. This demonstrates that co-optimization of architecture, optimization algorithms, and training methodology is critical for enhancing both resource efficiency and capability scalability in LLM development.

Technology Category

Application Category

📝 Abstract
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
Problem

Research questions and friction points this paper is trying to address.

Improving efficiency of large language models under constrained compute budgets
Enhancing representational efficiency through disentangled attention pathways
Achieving competitive performance with smaller model size through optimized design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grouped Differential Attention disentangles signal and noise pathways
Curriculum-driven scheduler changes data composition ratio
MuonClip optimizer with fused kernels boosts training efficiency
🔎 Similar Papers
J
Junghwan Lim
Motif Technologies
Sungmin Lee
Sungmin Lee
AIX, SK Telecom
Machine LearningComputer Vision
D
Dongseok Kim
Motif Technologies
T
Taehyun Kim
Motif Technologies
E
Eunhwan Park
Motif Technologies
J
Jeesoo Lee
Motif Technologies
J
Jeongdoo Lee
Motif Technologies
Junhyeok Lee
Junhyeok Lee
Johns Hopkins University, Center for Language and Signal Processing
Speech and Language ProcessingSpeech ProcessingSpeech SynthesisGenerative Model
Wai Ting Cheung
Wai Ting Cheung
Motif Technologies
Dahye Choi
Dahye Choi
Motif Technologies
J
Jaeheui Her
Motif Technologies
J
Jaeyeon Huh
Motif Technologies
H
Hanbin Jung
Motif Technologies
C
Changjin Kang
Motif Technologies
B
Beomgyu Kim
Motif Technologies
M
Minjae Kim
Motif Technologies
Taewhan Kim
Taewhan Kim
Seoul National University, Department of Electrical and Computer Engineering
Electronic Design Automation
Y
Youngrok Kim
Motif Technologies
H
Hyukjin Kweon
Motif Technologies
H
Haesol Lee
Motif Technologies
K
Kun-Hui Lee
Motif Technologies
D
Dongpin Oh
Motif Technologies
Y
Yeongjae Park
Motif Technologies
B
Bokki Ryu
Motif Technologies
D
Dongjoo Weon
Motif Technologies