🤖 AI Summary
Existing video large language models (VLMMs) lack adaptive reasoning under dynamic evidence, failing to revise initial judgments in light of newly observed information.
Method: We introduce “defeasible video entailment,” a novel task requiring models to assess whether a hypothesis is strengthened or weakened by incremental visual-audio evidence and to generate logically consistent updated conclusions. We pioneer the integration of defeasible reasoning into video understanding, establishing a dedicated benchmark and evaluation protocol. Our approach—chain-of-counterfactual reasoning—combines ASR-enhanced transcription, rationale refinement, and multimodal semantic alignment to enable counterfactual-driven logical revision.
Contribution/Results: Experiments demonstrate significant improvements over baselines on both classification and generation tasks. Under our proposed metrics, generated conclusions better align with contextual evolution, substantially enhancing VLMMs’ skeptical, evidence-adaptive reasoning capabilities.
📝 Abstract
Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.