🤖 AI Summary
Current embodied intelligence faces two key bottlenecks: (1) ineffective fusion of multimodal inputs (≥3 modalities), and (2) inability to support human-in-the-loop, full-duplex real-time interaction. This paper introduces the first native full-duplex multimodal embodied intelligence model, jointly processing visual, speech, and textual modalities while departing from conventional half-duplex paradigms. Our core contributions are: (1) a unified multimodal fusion backbone; (2) a streaming vision-grounded dialogue mechanism enabling real-time visual grounding during ongoing speech; and (3) a low-latency full-duplex scheduling algorithm achieving a theoretical end-to-end latency of 80 ms. Evaluated on realistic streaming audio-visual dialogue scenarios, our model achieves a 3.2× improvement in response speed over prior work, attains a Mean Opinion Score (MOS) of 4.1 for speech naturalness, and matches state-of-the-art half-duplex models in content quality—demonstrating, for the first time, the simultaneous realization of high responsiveness, high naturalness, and high fidelity.
📝 Abstract
Humans naturally process real-world multimodal information in a full-duplex manner. In artificial intelligence, replicating this capability is essential for advancing model development and deployment, particularly in embodied contexts. The development of multimodal models faces two primary challenges: (1) effectively handling more than three modalities-such as vision, audio, and text; and (2) delivering full-duplex responses to rapidly evolving human instructions. To facilitate research on models that support both omnimodal processing and full duplexity, we present RoboEgo (alias: FLM-Ego), a unified model system designed to address both challenges. RoboEgo incorporates a backbone architecture and algorithms that natively support full duplexity, achieving a theoretical duplex latency of 80 ms. In streaming visually grounded conversations under real-world conditions, RoboEgo exhibits superior responsiveness and speech naturalness, while maintaining comparable content qualities to state-of-the-art semi-duplex omnimodal models-a feat previously considered unattainable by native full-duplex systems.