OVG-HQ: Online Video Grounding with Hybrid-modal Queries

📅 2025-08-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces Online Video Grounding (OVG), the first task addressing real-time localization of streaming video segments in response to multimodal queries—including text, images, video clips, or their combinations. To tackle this challenge, we propose OVG-HQ-Unify, a unified framework featuring a parametric memory block to model temporal dependencies, cross-modal knowledge distillation to mitigate modality imbalance, and explicit multimodal feature alignment. We further construct QVHighlights-Unify, the first multimodal benchmark specifically designed for online video grounding, and introduce omAP—a novel evaluation metric jointly optimizing temporal precision and inference latency. Extensive experiments demonstrate that OVG-HQ-Unify significantly outperforms state-of-the-art methods on both our new benchmark and existing datasets. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at https://github.com/maojiaqi2324/OVG-HQ.
Problem

Research questions and friction points this paper is trying to address.

Online video grounding with hybrid-modal queries
Addressing limited context and modality imbalance
Evaluating accuracy and efficiency in online settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametric Memory Block for knowledge retention
Cross-modal distillation for modality balance
Online metrics for timeliness evaluation
🔎 Similar Papers
R
Runhao Zeng
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University
J
Jiaqi Mao
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University
M
Minghao Lai
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University
M
Minh Hieu Phan
University of Adelaide
Yanjie Dong
Yanjie Dong
Associate Professor, Shenzhen MSU-BIT University
Machine learning and optimizationwireless for AI
W
Wei Wang
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University
Q
Qi Chen
University of Adelaide
Xiping Hu
Xiping Hu
Professor in Beijing Institute of Technology
Cyber-Physical SystemCrowd ComputingAffective Computing