Boosting Temporal Sentence Grounding via Causal Inference

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing temporal sentence grounding (TSG) methods suffer from spurious correlations between inherent textual biases—such as verb or phrase co-occurrence—and salient visual patterns in videos, leading to unreliable localization and poor out-of-distribution (OOD) generalization. To address this, we propose the first causal intervention and counterfactual reasoning framework for TSG. Our approach constructs a structural causal model (SCM) and applies do-calculus to perform causal interventions on textual variables. Furthermore, we design counterfactual scenarios that rely solely on video features, thereby decoupling visual representations from textual biases. Evaluated on multiple public benchmarks, our method achieves significant improvements over state-of-the-art approaches, particularly under complex scenes and OOD settings, demonstrating superior robustness and generalization capability.

Technology Category

Application Category

📝 Abstract
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.
Problem

Research questions and friction points this paper is trying to address.

Eliminating spurious correlations in video-text queries
Addressing textual biases affecting visual moment prediction
Enhancing model robustness via causal inference techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes causal inference for TSG robustness
Proposes textual causal intervention with do-calculus
Debiases via visual counterfactual reasoning
🔎 Similar Papers
No similar papers found.
K
Kefan Tang
Xidian University, Xi’an, Shaanxi, China
Lihuo He
Lihuo He
Professor, Xidian University
Image/Video Quality AssessmentVisual Perception
J
Jisheng Dang
National University of Singapore, Singapore
X
Xinbo Gao
Xidian University, Xi’an, Shaanxi, China