🤖 AI Summary
This paper introduces the first unified foundation model bridging embodied AI and autonomous driving, addressing the long-standing knowledge transfer bottleneck caused by their disciplinary isolation. Methodologically, we propose a multi-stage collaborative learning framework, construct a high-quality cross-domain dataset, and jointly fine-tune the model via chain-of-thought (CoT) reasoning and reinforcement learning (RL) to enable bidirectional positive transfer. Our core contribution is the first semantic and decision-level alignment mechanism between the two domains, yielding significant complementary gains in task planning, affordance prediction, spatial understanding, state estimation, and driving policy generation. Extensive experiments across 17 embodied AI and 12 autonomous driving benchmarks demonstrate consistent superiority over state-of-the-art open-source, closed-source, and domain-specific models, validating both cross-domain generalization capability and mutual performance enhancement.
📝 Abstract
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.