Magistral

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work investigates end-to-end training of large language models (LLMs) under a pure reinforcement learning (RL) paradigm, aiming to preserve and enhance multimodal understanding, instruction following, and function-calling capabilities—using text-only data. To this end, we design a scalable, fully self-contained RL training pipeline built entirely on custom infrastructure, without reliance on external RL trajectories or knowledge distillation data. We introduce the first text-driven RL training method enabling fine-grained, language-guided control over reasoning steps. Leveraging Mistral Medium 3 as the base model, we devise cold-start dataset construction and policy optimization algorithms. We release Magistral Medium—a pure-RL inference model—and open-source Magistral Small—including the cold-start dataset—under the Apache 2.0 license. Experiments demonstrate substantial gains over supervised fine-tuning baselines on complex reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.

Problem

Research questions and friction points this paper is trying to address.

Develop scalable RL pipeline for LLM training

Explore pure RL training limits for LLMs

Maintain capabilities with RL on text data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable pure RL training pipeline

Method to force reasoning language

RL maintains initial model capabilities

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?