Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Open-source large language models (LLMs) underperform closed-source counterparts on complex reasoning and long-context tasks (e.g., 64K tokens), while suffering from training instability and model collapse. Method: We propose a novel training paradigm centered on a 12.7B-parameter open-source LLM, featuring (i) a two-stage curriculum-style supervised fine-tuning (SFT), (ii) difficulty-aware reinforcement learning fine-tuning (RLFT), (iii) validation-aligned synthetic data generation, (iv) policy trajectory reuse, (v) stability-aware filtering, and (vi) hybrid parallelism with kernel-level optimizations for efficient long-context training. Contribution/Results: Our model achieves performance on par with significantly larger closed-source models across mathematical reasoning, code generation, and agent benchmarks—despite its moderate scale—while ensuring training reproducibility and stability. This work delivers a high-performance, scalable, and openly accessible framework for advancing complex reasoning capabilities in the open-source community.

Technology Category

Application Category

📝 Abstract

We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.

Problem

Research questions and friction points this paper is trying to address.

Bridges gap between open and proprietary reasoning models

Addresses model collapse and training instability challenges

Scales reasoning capabilities under realistic compute constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid parallelism and kernel optimizations for 64K-token contexts

Two-stage SFT curriculum with verified synthetic data

RLFT pipeline with difficulty-aware filtering and trajectory reuse

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting