End-to-end Listen, Look, Speak and Act

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling human-like multimodal interaction—such as turn-taking, concurrent speech and gesture, visual question answering, and action interruption—in a fully duplex, end-to-end manner. We propose the first unified model jointly processing auditory, visual, linguistic, and motor modalities. Methodologically, we introduce SA-MoE: a Sparse Attention Mixture-of-Experts architecture with a shared self-attention backbone that dynamically routes modality-specific tokens to dedicated expert modules, while integrating pretrained unimodal components to support fine-grained cross-modal coordination and context-aware concurrent generation. Evaluated on speech interaction and robotic manipulation benchmarks, our model matches state-of-the-art unimodal methods in modality-specific tasks and significantly improves coherence and response naturalness in complex interactive scenarios. This work represents a substantive advance toward general interactive intelligence, shifting paradigms from unidirectional perception to bidirectional, synergistic multimodal interaction.

Technology Category

Application Category

📝 Abstract
Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Simulating human multimodal full-duplex interaction capabilities computationally
Enabling simultaneous perception and generation across vision, text, speech, action
Developing unified architecture for natural turn-taking and interruption behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

First full-duplex end-to-end multimodal interaction model
Novel SA-MoE architecture routes modalities to experts
Enables simultaneous perception and generation across modalities
🔎 Similar Papers
No similar papers found.
S
Siyin Wang
Tsinghua University
W
Wenyi Yu
Tsinghua University
X
Xianzhao Chen
ByteDance
Xiaohai Tian
Xiaohai Tian
National University of Singapore (NUS)
Voice conversionText-to-speechAnti-spoofing
J
Jun Zhang
ByteDance
L
Lu Lu
ByteDance
C
Chao Zhang
Tsinghua University