đ€ AI Summary
This work investigates end-to-end training of large language models (LLMs) under a pure reinforcement learning (RL) paradigm, aiming to preserve and enhance multimodal understanding, instruction following, and function-calling capabilitiesâusing text-only data. To this end, we design a scalable, fully self-contained RL training pipeline built entirely on custom infrastructure, without reliance on external RL trajectories or knowledge distillation data. We introduce the first text-driven RL training method enabling fine-grained, language-guided control over reasoning steps. Leveraging Mistral Medium 3 as the base model, we devise cold-start dataset construction and policy optimization algorithms. We release Magistral Mediumâa pure-RL inference modelâand open-source Magistral Smallâincluding the cold-start datasetâunder the Apache 2.0 license. Experiments demonstrate substantial gains over supervised fine-tuning baselines on complex reasoning benchmarks.
đ Abstract
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.