Cross-modal Causal Relation Alignment for Video Question Grounding

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Video Question Generation (VideoQG) methods are vulnerable to spurious cross-modal correlations, leading to erroneous visual scene interpretation and unfaithful reasoning. To address this, we propose the first cross-modal deconfounding framework specifically designed for VideoQG. Our approach integrates Gaussian-smoothed temporal localization, bidirectional contrastive alignment, and explicit front-door/back-door causal interventions to enforce causal consistency between question-answering intent and video segments. By decoupling multimodal features and leveraging cross-attention mechanisms under weak supervision, we overcome key bottlenecks in causal modeling. Evaluated on two major VideoQG benchmarks, our method achieves significant improvements in localization accuracy and QA robustness while effectively suppressing spurious correlations. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.
Problem

Research questions and friction points this paper is trying to address.

Eliminate spurious cross-modal correlations in VideoQG
Improve causal consistency between QA and video grounding
Enhance robustness and generalization in vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Smoothing Grounding for time interval estimation
Cross-Modal Alignment using bidirectional contrastive learning
Explicit Causal Intervention for multimodal deconfounding
🔎 Similar Papers
No similar papers found.