ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the limitations of existing large multimodal models in streaming audio-visual understanding, which often suffer from incomplete modality support or a lack of proactive monitoring capabilities. To overcome these challenges, the authors propose ROMA (Real-time Omni-Modal Assistant), a novel framework that decouples response triggering from content generation via a lightweight “speaking head” mechanism, enabling unified active and passive interaction. ROMA further introduces a multimodal unit aligned with dense audio and discrete video frames, trained through a two-stage curriculum strategy and optimized on a streaming dataset. Evaluated across 12 benchmark tasks, ROMA achieves state-of-the-art performance on active tasks while maintaining strong results on passive ones, demonstrating its effectiveness as a unified and efficient solution for real-time multimodal understanding.

Technology Category

Application Category

📝 Abstract

Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

Problem

Research questions and friction points this paper is trying to address.

streaming audio-video understanding

omni-multimodal

proactive monitoring

real-time interaction

modality support

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming multimodal understanding

proactive interaction

modality alignment