🤖 AI Summary
To address insufficient cross-modal alignment between visual observations and textual instructions in language-guided robotic manipulation, this paper proposes an end-to-end framework based on joint vision-language representation learning. The method introduces two self-supervised pretraining objectives: (i) masked target image reconstruction—enabling implicit learning of vision–language–action mappings without action annotations; and (ii) conditional image generation—to enhance cross-modal alignment. Trained on the newly introduced Omni-Object Pick-and-Place dataset, the framework is evaluated across five simulation and real-robot benchmarks. Results demonstrate that it achieves superior performance with only a few demonstrations for fine-tuning, significantly outperforming existing two-stage approaches and mainstream end-to-end baselines in task accuracy, generalization to unseen objects/contexts, and sample efficiency.
📝 Abstract
Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visual-textual associations through a self-supervised pretext task: reconstructing a masked goal image conditioned on an input image and textual instructions. This formulation allows the model to learn visual-action representations without robot action supervision. The learned representations can then be fine-tuned for manipulation tasks with only a few demonstrations. We also introduce the extit{Omni-Object Pick-and-Place} dataset, which consists of annotated robot tabletop manipulation episodes, including 180 object classes and 3,200 instances with corresponding textual instructions. This dataset enables the model to acquire diverse object priors and allows for a more comprehensive evaluation of its generalisation capability across object instances. Experimental results on the five benchmarks, including both simulated and real-robot validations, demonstrate that our method outperforms prior art.