🤖 AI Summary
Existing vision-language-action (VLA) models rely exclusively on textual instructions, neglecting more natural speech-based interaction; conventional speech integration requires cascaded automatic speech recognition (ASR), introducing error propagation and discarding non-semantic cues such as speaker identity. This work proposes VLAS, an end-to-end speech-driven VLA model that directly maps raw speech to robot actions. Our method introduces three key innovations: (1) a novel intra-modal speech-text alignment mechanism leveraging contrastive learning for fine-grained cross-modal alignment; (2) a speech-aware retrieval-augmented generation (RAG) paradigm enabling speaker-identity-aware personalized task execution; and (3) two new speech-action benchmarks—SQA and CSI—to support three-stage speech-specific fine-tuning. Experiments demonstrate that VLAS significantly improves task success rates across diverse spoken instructions, achieving unified multimodal interaction spanning text, images, speech, and robotic actions.
📝 Abstract
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.