🤖 AI Summary
To address performance bottlenecks in general-purpose visual understanding and multimodal reasoning, this work introduces MiMo-VL, an open-source vision-language model (7B). We propose a novel four-stage pretraining paradigm on 2.4 trillion tokens—the largest-scale VL pretraining to date—and empirically demonstrate, for the first time, the critical efficacy of chain-of-thought data in pretraining. Methodologically, we integrate mixed online reinforcement learning (MORL), multi-source reward signal fusion, and GUI-grounded evaluation via OSWorld-G. We further construct a unified benchmark suite spanning 50+ tasks. Experiments show that MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 of 40 benchmarks; achieves 59.4 on OlympiadBench—surpassing even 78B-parameter models—and attains 56.1 on OSWorld-G, setting a new state-of-the-art in GUI-grounded reasoning.
📝 Abstract
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.