Cross-Hand Latent Representation for Vision-Language-Action Models

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge that existing vision-language-action (VLA) models struggle to generalize across different dexterous robotic hands in manipulation tasks, while collecting large-scale demonstration data for each new hand is prohibitively expensive. To overcome this, the authors propose XL-VLA, a novel framework that introduces, for the first time, a shared implicit action representation across hand morphologies tailored for dexterous manipulation, achieving embodiment invariance and seamless integration into standard VLA architectures. By jointly modeling vision, language, and action through implicit action encoding and cross-hand alignment training, XL-VLA unifies multimodal perception and linguistic instructions within a common implicit action space. Experiments demonstrate that XL-VLA significantly outperforms baselines operating in raw joint spaces across multiple dexterous hand platforms, markedly improving cross-embodiment generalization and data efficiency.

Technology Category

Application Category

📝 Abstract

Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception--vision, sound, and language-guided intent--to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce XL-VLA, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that XL-VLA consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.

Problem

Research questions and friction points this paper is trying to address.

dexterous manipulation

cross-embodiment learning

vision-language-action models

scalable robotics

latent representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-embodiment learning

latent action space

vision-language-action models