Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Generalist agents for open-world environments like Minecraft face three key challenges: scarcity of domain-specific annotated data, severe interference across heterogeneous tasks, and high visual observation diversity. This paper introduces the first general-purpose multimodal agent specifically designed for Minecraft. Our approach addresses these challenges through three core contributions: (1) a knowledge-enhanced synthetic data generation pipeline to alleviate data scarcity; (2) a task-level routed Mixture-of-Experts (MoE) architecture enabling both cross-task decoupling and collaboration; and (3) a reinforcement learning framework integrating vision-language alignment with multimodal reasoning to strengthen the perception–planning–execution–reflection closed loop. Evaluated on multiple Minecraft benchmark tasks, our method consistently outperforms existing generalist multimodal large language models and specialized agents, achieving significant improvements in generalization capability and task completion rates.

Technology Category

Application Category

📝 Abstract

Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/

Problem

Research questions and friction points this paper is trying to address.

Insufficient domain-specific data for Minecraft agents

Interference among heterogeneous tasks in open-world environments

Visual diversity challenges in Minecraft agent perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-enhanced data generation pipeline

Mixture-of-Experts with task-level routing

Multimodal Reasoning-Augmented Reinforcement Learning

🔎 Similar Papers

Odyssey: Empowering Minecraft Agents with Open-World Skills