🤖 AI Summary
This work addresses the challenge of enabling human-like multimodal interaction—such as turn-taking, concurrent speech and gesture, visual question answering, and action interruption—in a fully duplex, end-to-end manner. We propose the first unified model jointly processing auditory, visual, linguistic, and motor modalities. Methodologically, we introduce SA-MoE: a Sparse Attention Mixture-of-Experts architecture with a shared self-attention backbone that dynamically routes modality-specific tokens to dedicated expert modules, while integrating pretrained unimodal components to support fine-grained cross-modal coordination and context-aware concurrent generation. Evaluated on speech interaction and robotic manipulation benchmarks, our model matches state-of-the-art unimodal methods in modality-specific tasks and significantly improves coherence and response naturalness in complex interactive scenarios. This work represents a substantive advance toward general interactive intelligence, shifting paradigms from unidirectional perception to bidirectional, synergistic multimodal interaction.
📝 Abstract
Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.