Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

📅 2024-04-02
🏛️ IEEE/RJS International Conference on Intelligent RObots and Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unsupervised vision–language–action joint modeling for robotic grasping. We propose a cross-modal latent-space alignment method based on a multimodal variational autoencoder (VAE), which jointly learns shared latent representations for images, natural language instructions, and continuous action trajectories—without requiring large-model fine-tuning. Our model-agnostic, fully unsupervised training strategy eliminates dependence on labeled data or task-specific adaptation. Evaluated in PyBullet/RLBench simulation environments, our approach improves action trajectory generation accuracy by 55% over baselines. Comprehensive ablation studies demonstrate strong generalization to object pose perturbations, varying numbers of distractors, and extended task horizons. The key contribution is the first empirical validation that a lightweight multimodal VAE can effectively replace fine-tuned large language and vision models—achieving comparable performance while substantially reducing computational cost and annotation requirements.

Technology Category

Application Category

📝 Abstract
In this work, we focus on unsupervised vision-language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced output. A more lightweight alter-native would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here, we explore whether and how multimodal VAEs can be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the results obtained, we propose a model-invariant training alternative that improves the models’ performance in a simulator by up to 55 %. Moreover, we systematically evaluate the challenges raised by individual tasks, such as object or robot position variability, number of distractors, or task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised vision-language-action mapping for robotic manipulation
Lightweight multimodal VAEs for latent feature integration
Model-invariant training to improve simulator performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal VAEs for unsupervised robotic manipulation
Model-invariant training boosts performance by 55%
Evaluates VAEs on vision-language-action integration challenges
🔎 Similar Papers
No similar papers found.
G
G. Sejnova
Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Prague, Czech Republic
M
M. Vavrecka
Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Prague, Czech Republic
K
Karla Stepanova
Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Prague, Czech Republic