Open-vocabulary Pick and Place via Patch-level Semantic Maps

📅 2024-06-21
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of robust generalization and few-shot adaptation for natural language–driven robots in unseen environments. We propose Grounded Equivariant Manipulation (GEM), the first framework to jointly leverage open-vocabulary understanding from vision-language models (VLMs) and geometric equivariance modeling. GEM constructs patch-level semantic maps and enables end-to-end semantic-to-action mapping via spatial equivariance constraints and cross-modal grounding—without fine-tuning large foundation models or requiring large-scale manipulation datasets. On multi-task benchmarks, GEM achieves a 37% improvement in zero-shot transfer accuracy over prior state-of-the-art methods. In real-robot experiments, it successfully executes over 120 unseen instructions, with an average of only two demonstrations per task, demonstrating strong generalization and practical deployability.

Technology Category

Application Category

📝 Abstract
Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and struggle with generalization. This paper introduces Grounded Equivariant Manipulation (GEM), a novel approach that leverages the generative capabilities of pre-trained vision-language models and geometric symmetries to facilitate few-shot and zero-shot learning for open-vocabulary robot manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and superior generalization across diverse pick-and-place tasks in both simulation and real-world experiments, showcasing its ability to adapt to novel instructions and unseen objects with minimal data requirements. GEM advances a significant step forward in the domain of language-conditioned robot control, bridging the gap between semantic understanding and action generation in robotic systems.
Problem

Research questions and friction points this paper is trying to address.

Improves robustness in language-conditioned robot manipulation tasks
Enhances efficiency with fewer robot data requirements
Addresses fragility in unseen scenarios and new tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pretrained vision-language models
Uses equivariant language mapping
High sample efficiency and generalization
🔎 Similar Papers
No similar papers found.