RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current embodied intelligence faces two key bottlenecks: (1) ineffective fusion of multimodal inputs (≥3 modalities), and (2) inability to support human-in-the-loop, full-duplex real-time interaction. This paper introduces the first native full-duplex multimodal embodied intelligence model, jointly processing visual, speech, and textual modalities while departing from conventional half-duplex paradigms. Our core contributions are: (1) a unified multimodal fusion backbone; (2) a streaming vision-grounded dialogue mechanism enabling real-time visual grounding during ongoing speech; and (3) a low-latency full-duplex scheduling algorithm achieving a theoretical end-to-end latency of 80 ms. Evaluated on realistic streaming audio-visual dialogue scenarios, our model achieves a 3.2× improvement in response speed over prior work, attains a Mean Opinion Score (MOS) of 4.1 for speech naturalness, and matches state-of-the-art half-duplex models in content quality—demonstrating, for the first time, the simultaneous realization of high responsiveness, high naturalness, and high fidelity.

Technology Category

Application Category

📝 Abstract
Humans naturally process real-world multimodal information in a full-duplex manner. In artificial intelligence, replicating this capability is essential for advancing model development and deployment, particularly in embodied contexts. The development of multimodal models faces two primary challenges: (1) effectively handling more than three modalities-such as vision, audio, and text; and (2) delivering full-duplex responses to rapidly evolving human instructions. To facilitate research on models that support both omnimodal processing and full duplexity, we present RoboEgo (alias: FLM-Ego), a unified model system designed to address both challenges. RoboEgo incorporates a backbone architecture and algorithms that natively support full duplexity, achieving a theoretical duplex latency of 80 ms. In streaming visually grounded conversations under real-world conditions, RoboEgo exhibits superior responsiveness and speech naturalness, while maintaining comparable content qualities to state-of-the-art semi-duplex omnimodal models-a feat previously considered unattainable by native full-duplex systems.
Problem

Research questions and friction points this paper is trying to address.

Handling more than three modalities effectively
Achieving full-duplex responses to human instructions
Maintaining content quality in real-world conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model system for omnimodal processing
Native full duplexity with 80 ms latency
Superior responsiveness and speech naturalness
🔎 Similar Papers
No similar papers found.
Yiqun Yao
Yiqun Yao
Unknown affiliation
X
Xiang Li
Beijing Academy of Artificial Intelligence, Beijing, China
X
Xin Jiang
Beijing Academy of Artificial Intelligence, Beijing, China
X
Xuezhi Fang
Beijing Academy of Artificial Intelligence, Beijing, China
Naitong Yu
Naitong Yu
Beijing Academy of Artificial Intelligence
Large Language ModelsNatural Language ProcessingArtificial Intelligence
A
Aixin Sun
School of Computer Science and Engineering, Nanyang Technological University, Singapore
Y
Yequan Wang
Beijing Academy of Artificial Intelligence, Beijing, China