OVG-HQ: Online Video Grounding with Hybrid-modal Queries

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper introduces Online Video Grounding (OVG), the first task addressing real-time localization of streaming video segments in response to multimodal queries—including text, images, video clips, or their combinations. To tackle this challenge, we propose OVG-HQ-Unify, a unified framework featuring a parametric memory block to model temporal dependencies, cross-modal knowledge distillation to mitigate modality imbalance, and explicit multimodal feature alignment. We further construct QVHighlights-Unify, the first multimodal benchmark specifically designed for online video grounding, and introduce omAP—a novel evaluation metric jointly optimizing temporal precision and inference latency. Extensive experiments demonstrate that OVG-HQ-Unify significantly outperforms state-of-the-art methods on both our new benchmark and existing datasets. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at https://github.com/maojiaqi2324/OVG-HQ.

Problem

Research questions and friction points this paper is trying to address.

Online video grounding with hybrid-modal queries

Addressing limited context and modality imbalance

Evaluating accuracy and efficiency in online settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametric Memory Block for knowledge retention

Cross-modal distillation for modality balance

Online metrics for timeliness evaluation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs