🤖 AI Summary
For language-guided grasp-and-place tasks in cluttered scenes, existing methods rely on large-scale datasets, suffer from error propagation across sequential stages, and neglect action priors. This paper proposes the A² framework, the first to jointly model multimodal action priors by aligning unconditional action priors with 3D vision-language priors via a single learnable attention layer. It employs shared grasp/place policies to enhance coordination and introduces a multimodal policy adaptation mechanism. A² achieves zero-shot generalization with only few-shot demonstrations, enabling transfer to unseen objects and novel instructions. In both simulation and real-robot experiments, it significantly improves task success rate and execution efficiency—reducing the number of grasp and place actions by 23% and 19%, respectively.
📝 Abstract
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions.