Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing efficiency and accuracy in existing active video understanding models, which struggle to make timely and precise frame-by-frame decisions. To overcome this limitation, the authors propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception through a “proposal-matching” mechanism. During the query phase, an Instruction-Guided Proposal Parser generates structured visual proposals; in the streaming phase, a lightweight Proposal Matching Module performs efficient embedding matching to trigger responses. This design effectively breaks the traditional trade-off between computational efficiency and prediction accuracy. Extensive experiments on StreamingBench and OVO-Bench demonstrate that Em-Garde significantly improves both response accuracy and computational efficiency under stringent resource constraints, validating its effectiveness for real-time active video understanding.

Technology Category

Application Category

📝 Abstract
Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.
Problem

Research questions and friction points this paper is trying to address.

Streaming Video Understanding
Proactive Response
Efficiency-Accuracy Trade-off
VideoLLMs
Real-time Video Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

propose-match framework
streaming video understanding
proactive VideoLLM
instruction-guided proposal
efficient embedding matching
🔎 Similar Papers
No similar papers found.