HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work addresses the limitation of existing video temporal sentence grounding methods, which are largely confined to closed-vocabulary settings and struggle to generalize to real-world queries containing novel words or diverse linguistic expressions. To overcome this, the study introduces the first open-vocabulary temporal sentence grounding in video (OV-TSGV) task, establishes two new benchmarks—Charades-OV and ActivityNet-OV—and proposes the HERO framework. HERO leverages hierarchical language embeddings, semantic-guided visual filtering, and a parallel cross-modal refinement mechanism to achieve strong generalization capabilities. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art models on both standard and open-vocabulary benchmarks, with particularly significant performance gains under the OV-TSGV setting, thereby validating its effectiveness and novelty.

Technology Category

Application Category

📝 Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.

Problem

Research questions and friction points this paper is trying to address.

Temporal Sentence Grounding

Open-Vocabulary

Video-Language Alignment

Generalization

Natural Language Query

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary TSGV

Hierarchical Embedding

Cross-Modal Refinement