VITED: Video Temporal Evidence Distillation

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing video question answering (VQA) models uniformly sample a fixed number of frames, hindering the capture of temporally sparse yet critical evidence and lacking the capability to localize multiple disjoint evidence segments across the full video context for multi-step reasoning. Method: We propose a Chain-based Temporal Evidence Reasoning framework that pioneers automatic construction and end-to-end generation of video evidence chains, breaking away from fixed-frame sampling. Our approach integrates optimization-driven evidence interval discovery, evidence-chain supervised training, and joint modeling of temporal localization and multi-hop reasoning. Contribution/Results: The framework enables fine-grained, cross-temporal visual evidence localization and compositional reasoning. It significantly outperforms state-of-the-art (SOTA) evidence-agnostic VQA methods on multiple long-video QA benchmarks, achieving substantial improvements in answer accuracy—particularly for complex, multi-step questions.

Technology Category

Application Category

📝 Abstract

We investigate complex video question answering via chain-of-evidence reasoning -- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enhance video question answering via evidence reasoning chains

Localize critical evidence in long-form video content

Improve multi-step reasoning in complex video QA tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-evidence reasoning for video QA

Automated optimal interval search for evidence

Direct generation of evidence chains by VITED

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding