RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Current embodied intelligence faces two key bottlenecks: (1) ineffective fusion of multimodal inputs (≥3 modalities), and (2) inability to support human-in-the-loop, full-duplex real-time interaction. This paper introduces the first native full-duplex multimodal embodied intelligence model, jointly processing visual, speech, and textual modalities while departing from conventional half-duplex paradigms. Our core contributions are: (1) a unified multimodal fusion backbone; (2) a streaming vision-grounded dialogue mechanism enabling real-time visual grounding during ongoing speech; and (3) a low-latency full-duplex scheduling algorithm achieving a theoretical end-to-end latency of 80 ms. Evaluated on realistic streaming audio-visual dialogue scenarios, our model achieves a 3.2× improvement in response speed over prior work, attains a Mean Opinion Score (MOS) of 4.1 for speech naturalness, and matches state-of-the-art half-duplex models in content quality—demonstrating, for the first time, the simultaneous realization of high responsiveness, high naturalness, and high fidelity.

Technology Category

Application Category

📝 Abstract

Humans naturally process real-world multimodal information in a full-duplex manner. In artificial intelligence, replicating this capability is essential for advancing model development and deployment, particularly in embodied contexts. The development of multimodal models faces two primary challenges: (1) effectively handling more than three modalities-such as vision, audio, and text; and (2) delivering full-duplex responses to rapidly evolving human instructions. To facilitate research on models that support both omnimodal processing and full duplexity, we present RoboEgo (alias: FLM-Ego), a unified model system designed to address both challenges. RoboEgo incorporates a backbone architecture and algorithms that natively support full duplexity, achieving a theoretical duplex latency of 80 ms. In streaming visually grounded conversations under real-world conditions, RoboEgo exhibits superior responsiveness and speech naturalness, while maintaining comparable content qualities to state-of-the-art semi-duplex omnimodal models-a feat previously considered unattainable by native full-duplex systems.

Problem

Research questions and friction points this paper is trying to address.

Handling more than three modalities effectively

Achieving full-duplex responses to human instructions

Maintaining content quality in real-world conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model system for omnimodal processing

Native full duplexity with 80 ms latency

Superior responsiveness and speech naturalness

🔎 Similar Papers

No similar papers found.