VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models rely exclusively on textual instructions, neglecting more natural speech-based interaction; conventional speech integration requires cascaded automatic speech recognition (ASR), introducing error propagation and discarding non-semantic cues such as speaker identity. This work proposes VLAS, an end-to-end speech-driven VLA model that directly maps raw speech to robot actions. Our method introduces three key innovations: (1) a novel intra-modal speech-text alignment mechanism leveraging contrastive learning for fine-grained cross-modal alignment; (2) a speech-aware retrieval-augmented generation (RAG) paradigm enabling speaker-identity-aware personalized task execution; and (3) two new speech-action benchmarks—SQA and CSI—to support three-stage speech-specific fine-tuning. Experiments demonstrate that VLAS significantly improves task success rates across diverse spoken instructions, achieving unified multimodal interaction spanning text, images, speech, and robotic actions.

Technology Category

Application Category

📝 Abstract
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.
Problem

Research questions and friction points this paper is trying to address.

Integrates speech recognition into robot policy model.
Overcomes error propagation in traditional speech integration.
Enables customized tasks with voiceprint recognition.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated speech recognition model
Multimodal interaction across modalities
Voice retrieval-augmented generation paradigm
🔎 Similar Papers
No similar papers found.