Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
For language-guided grasp-and-place tasks in cluttered scenes, existing methods rely on large-scale datasets, suffer from error propagation across sequential stages, and neglect action priors. This paper proposes the A² framework, the first to jointly model multimodal action priors by aligning unconditional action priors with 3D vision-language priors via a single learnable attention layer. It employs shared grasp/place policies to enhance coordination and introduces a multimodal policy adaptation mechanism. A² achieves zero-shot generalization with only few-shot demonstrations, enabling transfer to unseen objects and novel instructions. In both simulation and real-robot experiments, it significantly improves task success rate and execution efficiency—reducing the number of grasp and place actions by 23% and 19%, respectively.

Technology Category

Application Category

📝 Abstract
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions.
Problem

Research questions and friction points this paper is trying to address.

Aligning unconditioned action priors with vision-language priors
Reducing data requirements for training robot policies
Enhancing pick and place task performance in cluttered environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns action priors with vision-language priors
Uses one attention layer for efficient training
Enhances performance with shared policy adaptation
🔎 Similar Papers
No similar papers found.
K
Kechun Xu
Zhejiang University, Hangzhou, China, and Alibaba Cloud, Hangzhou, China
Xunlong Xia
Xunlong Xia
Alibaba Cloud, Hangzhou, China
K
Kaixuan Wang
Zhejiang University, Hangzhou, China
Yifei Yang
Yifei Yang
Shanghai Jiao Tong University
Natural Language Processing
Yunxuan Mao
Yunxuan Mao
Zhejiang University
computer vision robotics
B
Bing Deng
Alibaba Cloud, Hangzhou, China
Rong Xiong
Rong Xiong
Zhejiang University
Robotics
Y
Yue Wang
Zhejiang University, Hangzhou, China