🤖 AI Summary
Existing full-modal video understanding approaches struggle to handle the continuously growing audio-visual context in streaming scenarios, lack autonomous response mechanisms, and are hindered by the absence of online evaluation benchmarks supporting continuous multi-turn interaction. To address these limitations, this work proposes StreamOV, a framework that introduces an evidence-guided memory compression mechanism to efficiently retain critical historical information and designs a hidden-state-driven dynamic response triggering strategy to enable low-latency online inference under a fixed memory budget. Additionally, we construct SOVBench, the first evaluation benchmark tailored for streaming full-modal understanding. Experiments demonstrate that StreamOV achieves state-of-the-art performance across multiple benchmarks and exhibits strong applicability in both online and offline settings, validating its effectiveness and generalizability.
📝 Abstract
While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.