🤖 AI Summary
In commonsense video question answering, black-box models often learn spurious correlations due to dataset biases, leading to failures in complex reasoning—especially over causal, temporal, and counterfactual cues.
Method: We propose the first video-clip-anchored entailment tree reasoning framework, which explicitly grounds question answering in localizable video segments. Our approach employs multi-stage dynamic tree expansion, cross-modal (video–language) entailment verification, and an LLM-driven bias-mitigating question rewriting mechanism to ensure interpretable and fair inference.
Contribution/Results: Extensive evaluation on both original and debiased benchmarks demonstrates substantial improvements in generalization and robustness across diverse visual-language models and complex reasoning tasks. Our framework advances trustworthy video understanding by introducing a transparent, segment-grounded reasoning paradigm that enhances both explainability and fairness in multimodal inference.
📝 Abstract
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.