MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the cross-modal semantic gap in video temporal grounding, which often leads to misalignment between query and background features and results in non-discriminative and semantically inconsistent direct matching. To mitigate this, the authors propose the MASRA framework, which leverages a multimodal large language model (MLLM) during training to generate event-level descriptions and clip-level captions, thereby constructing dual-level textual priors. The framework introduces a decoupled alignment interaction mechanism and a context-aware codebook, enhancing semantic-temporal correspondence and structural consistency through Event Semantic Temporal Alignment (ESTA) and Local Relational Consistency Alignment (LRCA). Notably, MLLM is not required during inference, significantly improving efficiency. Extensive experiments demonstrate that the proposed method outperforms existing approaches on multiple benchmarks, and ablation studies confirm the effectiveness of each component.

📝 Abstract

Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a textual relation matrix derived from clip-level captions and aligns it with the temporal feature similarity matrix in the model, enhancing temporal consistency while capturing local structural information. MASRA includes two simple supporting modules, semantic-guided enhancement and second-order relational attention, to better utilize the learned semantic context and relational structure. Moreover, we introduce Decoupled Alignment Interaction (DAI) with a context-aware codebook to adaptively absorb query-irrelevant semantics and alleviate the cross-modal gap. The MLLM is only invoked during training and is not used at inference. Extensive experiments show that MASRA outperforms existing methods, and ablation studies validate its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Video Temporal Grounding

cross-modal semantic gap

temporal semantics

semantic alignment

discriminability

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-assisted alignment

Semantic-Relational Consistency

Video Temporal Grounding