🤖 AI Summary
This work addresses the challenge of robust generalization and few-shot adaptation for natural language–driven robots in unseen environments. We propose Grounded Equivariant Manipulation (GEM), the first framework to jointly leverage open-vocabulary understanding from vision-language models (VLMs) and geometric equivariance modeling. GEM constructs patch-level semantic maps and enables end-to-end semantic-to-action mapping via spatial equivariance constraints and cross-modal grounding—without fine-tuning large foundation models or requiring large-scale manipulation datasets. On multi-task benchmarks, GEM achieves a 37% improvement in zero-shot transfer accuracy over prior state-of-the-art methods. In real-robot experiments, it successfully executes over 120 unseen instructions, with an average of only two demonstrations per task, demonstrating strong generalization and practical deployability.
📝 Abstract
Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and struggle with generalization. This paper introduces Grounded Equivariant Manipulation (GEM), a novel approach that leverages the generative capabilities of pre-trained vision-language models and geometric symmetries to facilitate few-shot and zero-shot learning for open-vocabulary robot manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and superior generalization across diverse pick-and-place tasks in both simulation and real-world experiments, showcasing its ability to adapt to novel instructions and unseen objects with minimal data requirements. GEM advances a significant step forward in the domain of language-conditioned robot control, bridging the gap between semantic understanding and action generation in robotic systems.