VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) models employ rigid interaction paradigms ill-suited for concurrent multimodal processing—integrating vision, audio, language, and motor control—and lack dynamic responsiveness to real-time user interruptions, thereby limiting the naturalness and flexibility of embodied collaboration. To address this, we propose a dual-VLA parallel architecture coupled with a “model-as-controller” paradigm, wherein an active model and a standby model coordinate to enable end-to-end generation of system-level action commands from multimodal sensory inputs (vision, speech, environment), while supporting near-real-time interruption handling. The framework integrates joint fine-tuning of automatic speech recognition, environmental perception, motor execution, and large language modeling. Evaluated on a physical humanoid robot, it achieves >95% success rates in emergency stop and voice-based interruption responses, enables stable concurrent speech and action execution, and significantly improves response latency and interaction naturalness in human-robot collaboration.

Technology Category

Application Category

📝 Abstract
Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.
Problem

Research questions and friction points this paper is trying to address.

Enabling concurrent seeing hearing speaking and acting
Handling real-time user interruptions dynamically
Overcoming rigid static interaction paradigms in VLA models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-model architecture enables concurrent interaction capabilities
Model-as-controller paradigm generates system commands via tokens
Fine-tuned VLM handles real-time interruptions and multitasking
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Liu
Nanjing University
Chaoyou Fu
Chaoyou Fu
Nanjing University
Multimodal LLMLLMBiometrics
C
Chi Yan
Tencent Youtu Lab
C
Chu Wu
Nanjing University
H
Haihan Gao
Tencent Youtu Lab
Yi-Fan Zhang
Yi-Fan Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionMultimodalityAlignmentMachine Learning
Shaoqi Dong
Shaoqi Dong
南京大学
C
Cheng Qian
Fourier Intelligence Inc.
B
Bin Luo
Fourier Intelligence Inc.
X
Xiuyong Yang
Fourier Intelligence Inc.
G
Guanwu Li
Fourier Intelligence Inc.
Y
Yusheng Cai
Fourier Intelligence Inc.
Y
Yunhang Shen
Tencent Youtu Lab
Deqiang Jiang
Deqiang Jiang
腾讯优图实验室
H
Haoyu Cao
Tencent Youtu Lab
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
Caifeng Shan
Caifeng Shan
Philips Research
Computer VisionPattern RecognitionMachine LearningImage/Video Analysis
R
Ran He
CASIA