🤖 AI Summary
This work addresses the challenge of balancing efficiency and accuracy in existing active video understanding models, which struggle to make timely and precise frame-by-frame decisions. To overcome this limitation, the authors propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception through a “proposal-matching” mechanism. During the query phase, an Instruction-Guided Proposal Parser generates structured visual proposals; in the streaming phase, a lightweight Proposal Matching Module performs efficient embedding matching to trigger responses. This design effectively breaks the traditional trade-off between computational efficiency and prediction accuracy. Extensive experiments on StreamingBench and OVO-Bench demonstrate that Em-Garde significantly improves both response accuracy and computational efficiency under stringent resource constraints, validating its effectiveness for real-time active video understanding.
📝 Abstract
Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.