Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Modeling cross-task behavior and disentangling complex observation-action-language couplings in the open-world environment of Minecraft remain challenging. To address these issues, this paper proposes the Goal-Observation-Action Conditioned Policy (GOAP), a multimodal embodied agent architecture. Its key contributions are: (1) an action-guided behavior encoder that dynamically compresses long-horizon observation-action histories; (2) MGOA—the first large-scale multimodal instruction-aligned dataset for Minecraft, comprising 25K videos and 30M goal-observation-action triplets; and (3) integration of multimodal large language models (MLLMs) for high-level planning, enhanced by joint video-language-action alignment to improve policy generalization. GOAP achieves state-of-the-art performance across atomic tasks, long-horizon tasks, and open-ended instruction-following tasks, while demonstrating significantly improved robustness and cross-task transferability.

Technology Category

Application Category

📝 Abstract

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA)} dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.

Problem

Research questions and friction points this paper is trying to address.

Model human behavior in open-world tasks

Integrate observations, actions, and language effectively

Enhance Minecraft agent performance via MLLM and GOAP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model

Goal-Observation-Action Conditioned Policy

Minecraft Goal-Observation-Action dataset

🔎 Similar Papers

Odyssey: Empowering Minecraft Agents with Open-World Skills