HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing video temporal sentence grounding methods, which are largely confined to closed-vocabulary settings and struggle to generalize to real-world queries containing novel words or diverse linguistic expressions. To overcome this, the study introduces the first open-vocabulary temporal sentence grounding in video (OV-TSGV) task, establishes two new benchmarks—Charades-OV and ActivityNet-OV—and proposes the HERO framework. HERO leverages hierarchical language embeddings, semantic-guided visual filtering, and a parallel cross-modal refinement mechanism to achieve strong generalization capabilities. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art models on both standard and open-vocabulary benchmarks, with particularly significant performance gains under the OV-TSGV setting, thereby validating its effectiveness and novelty.

Technology Category

Application Category

📝 Abstract
Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.
Problem

Research questions and friction points this paper is trying to address.

Temporal Sentence Grounding
Open-Vocabulary
Video-Language Alignment
Generalization
Natural Language Query
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary TSGV
Hierarchical Embedding
Cross-Modal Refinement
Semantic-Guided Filtering
Contrastive Masked Text Refinement
🔎 Similar Papers
No similar papers found.
T
Tingting Han
Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University
X
Xinsong Tao
Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University
Y
Yufei Yin
Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University
Min Tan
Min Tan
Professor of School of Computer Science and Technology, Hangzhou Dianzi University
Machine LearningImage ProcessingMultimediaComputer Vision
Sicheng Zhao
Sicheng Zhao
Tsinghua University
Affective ComputingMultimediaDomain AdaptationComputer Vision
Z
Zhou Yu
Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University