ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

📅 2024-10-23
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) struggle to model spatial semantics and bridge low-level visual observations with high-level abstract planning in open-world embodied decision-making. Method: We propose Vision-Temporal Contextual Prompting (VTCP), a novel mechanism that replaces conventional linguistic subtask decomposition by leveraging historical segmentation masks and real-time object trajectories from SAM-2 as spatial-semantic intermediaries between the VLM and the low-level policy model. VTCP integrates multi-frame visual stitching with segmentation-mask-guided end-to-end policy learning. Results: Evaluated in the open-ended Minecraft environment, VTCP achieves an absolute 76% improvement in interactive task performance and, for the first time, successfully completes multiple complex embodied tasks requiring fine-grained spatial reasoning—significantly advancing the frontier of VLM-driven embodied intelligence.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a $mathbf{76}%$ absolute improvement in open-world interaction performance. Codes and demos are now available on the project page: https://craftjarvis.github.io/ROCKET-1.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to open-world decision-making
Bridging low-level observations with abstract planning concepts
Enhancing spatial reasoning in complex task execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-temporal context prompting for VLMs
Object segmentation guides policy interactions
Real-time object tracking with SAM-2
🔎 Similar Papers
No similar papers found.