🤖 AI Summary
This work addresses the lack of efficient, multifunctional small language models suitable for compute- and memory-constrained environments by introducing the Ministral 3 series—parameter-efficient dense models at 3B, 8B, and 14B scales. Each variant is released in three versions: base pretrained, instruction-tuned, and reasoning-optimized, with support for multimodal image understanding. The core innovation lies in a cascaded distillation approach that integrates iterative pruning, continual knowledge distillation, and multitask continued pretraining, achieving substantial gains in inference efficiency without compromising performance. Evaluated across complex reasoning and general-purpose tasks, the entire model family demonstrates strong empirical results and is released under the Apache 2.0 license.
📝 Abstract
We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.