CVA: Context-aware Video-text Alignment for Video Temporal Grounding

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of video-text misalignment in temporal video grounding, where irrelevant background content often introduces false negatives—particularly near temporal boundaries. To mitigate this issue, the authors propose a unified framework comprising Query-aware Context Diversification (QCD) for data augmentation, a Context-Invariant Boundary Discrimination (CBD) loss function, and a Context-enhanced Transformer Encoder (CTE). By jointly optimizing data augmentation and model architecture, the approach enhances query sensitivity and contextual robustness. The method integrates similarity-based segment replacement, contrastive learning, and hierarchical attention mechanisms, achieving state-of-the-art performance on benchmarks such as QVHighlights and Charades-STA, with a notable improvement of approximately 5 percentage points in Recall@1 over existing methods.

Technology Category

Application Category

📝 Abstract

We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

Problem

Research questions and friction points this paper is trying to address.

video temporal grounding

video-text alignment

context robustness

temporal boundaries

false negatives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-aware Alignment

Video Temporal Grounding

Query-aware Context Diversification