🤖 AI Summary
This work addresses the challenge of unsupervised vision–language–action joint modeling for robotic grasping. We propose a cross-modal latent-space alignment method based on a multimodal variational autoencoder (VAE), which jointly learns shared latent representations for images, natural language instructions, and continuous action trajectories—without requiring large-model fine-tuning. Our model-agnostic, fully unsupervised training strategy eliminates dependence on labeled data or task-specific adaptation. Evaluated in PyBullet/RLBench simulation environments, our approach improves action trajectory generation accuracy by 55% over baselines. Comprehensive ablation studies demonstrate strong generalization to object pose perturbations, varying numbers of distractors, and extended task horizons. The key contribution is the first empirical validation that a lightweight multimodal VAE can effectively replace fine-tuned large language and vision models—achieving comparable performance while substantially reducing computational cost and annotation requirements.
📝 Abstract
In this work, we focus on unsupervised vision-language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced output. A more lightweight alter-native would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here, we explore whether and how multimodal VAEs can be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the results obtained, we propose a model-invariant training alternative that improves the models’ performance in a simulator by up to 55 %. Moreover, we systematically evaluate the challenges raised by individual tasks, such as object or robot position variability, number of distractors, or task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.