Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large language models (VLMMs) lack adaptive reasoning under dynamic evidence, failing to revise initial judgments in light of newly observed information. Method: We introduce “defeasible video entailment,” a novel task requiring models to assess whether a hypothesis is strengthened or weakened by incremental visual-audio evidence and to generate logically consistent updated conclusions. We pioneer the integration of defeasible reasoning into video understanding, establishing a dedicated benchmark and evaluation protocol. Our approach—chain-of-counterfactual reasoning—combines ASR-enhanced transcription, rationale refinement, and multimodal semantic alignment to enable counterfactual-driven logical revision. Contribution/Results: Experiments demonstrate significant improvements over baselines on both classification and generation tasks. Under our proposed metrics, generated conclusions better align with contextual evolution, substantially enhancing VLMMs’ skeptical, evidence-adaptive reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMMs' ability to revise interpretations with new video evidence
Introducing Defeasible Video Entailment task for dynamic hypothesis updating
Developing frameworks to reduce bias and improve generative updates in VLMMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Counterfactual Thought for bias reduction
ASR-enhanced video with LLM for updates
Novel benchmark dataset with LLM metrics
🔎 Similar Papers
No similar papers found.
Y
Yue Zhang
Department of Computer Science, The University of Texas at Dallas, Richardson, TX, USA
J
Jilei Sun
Department of Computer Science, The University of Texas at Dallas, Richardson, TX, USA
Yunhui Guo
Yunhui Guo
UT Dallas
Computer VisionMachine LearningEdge Computing
Vibhav Gogate
Vibhav Gogate
The University of Texas at Dallas
Artificial IntelligenceGraphical modelsmachine learningProbabilistic InferenceStatistical Relational Learning