ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

πŸ“… 2026-01-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing large multimodal models in streaming audio-visual understanding, which often suffer from incomplete modality support or a lack of proactive monitoring capabilities. To overcome these challenges, the authors propose ROMA (Real-time Omni-Modal Assistant), a novel framework that decouples response triggering from content generation via a lightweight β€œspeaking head” mechanism, enabling unified active and passive interaction. ROMA further introduces a multimodal unit aligned with dense audio and discrete video frames, trained through a two-stage curriculum strategy and optimized on a streaming dataset. Evaluated across 12 benchmark tasks, ROMA achieves state-of-the-art performance on active tasks while maintaining strong results on passive ones, demonstrating its effectiveness as a unified and efficient solution for real-time multimodal understanding.

Technology Category

Application Category

πŸ“ Abstract
Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.
Problem

Research questions and friction points this paper is trying to address.

streaming audio-video understanding
omni-multimodal
proactive monitoring
real-time interaction
modality support
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming multimodal understanding
proactive interaction
modality alignment
decoupled response triggering
real-time omni-multimodal assistant
πŸ”Ž Similar Papers
No similar papers found.
Xueyun Tian
Xueyun Tian
Institute of Computing Technology
Multimodal GenerationMLLM
Wei Li
Wei Li
Institute of Computing Technology, Chinese Academy of Sciences
computer
Bingbing Xu
Bingbing Xu
Associate professor, Institute of Computing Technology, Chinese Academy of Sciences
Graph Neural NetworksNetwork Embedding
H
Heng Dong
Tsinghua University, Beijing, China
Y
Yuanzhuo Wang
CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China
H
Huawei Shen
CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China