🤖 AI Summary
This work addresses the challenge of temporal event localization in videos, where existing video large language models struggle to capture the semantic continuity and completeness of events, often resulting in ambiguous localization. To overcome this limitation, we propose a novel video large language model tailored for temporal video grounding. Our approach introduces a unified <event> token to aggregate information across all frames of an event, enabling holistic perception. We further enhance temporal modeling by smoothing the similarity curve over time using Savitzky-Golay filtering and design a multi-granularity frame feature aggregation strategy to better capture temporal dynamics. Extensive experiments demonstrate that our method significantly outperforms current state-of-the-art video large language models across multiple benchmark datasets, achieving more precise and robust temporal event localization.
📝 Abstract
Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a specialtoken that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.