Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of tightly coupled temporal localization and answer reasoning, along with high computational overhead in long-video question answering (QA), this paper proposes a two-stage interpretable QA framework. In the first stage, a low-frame-rate video summary enables coarse-grained temporal localization of question-relevant segments. In the second stage, span-aware visual token reallocation operates at a higher effective frame rate, jointly optimizing temporal span prediction and multiple-choice answer selection. We introduce a novel multiple-choice QA dataset with explicit temporal span annotations and design an interleaved grouped relative target loss to backpropagate answer correctness gradients to the temporal localization module, enabling end-to-end attributable training. Our coupling loss integrates temporal Intersection-over-Union (tIoU) and answer accuracy under a fixed token budget. Compared to uniform sampling, our method reduces input frames by 50% while achieving up to 8.6% performance gains on Charades-STA and ActivityNet-Captions, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
We present emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first emph{localizing} question-relevant interval(s) with a low-fps skim and then emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce dataname{}, which converts description based event graphs into emph{span-grounded} multiple-choice QA by pairing each question with emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.
Problem

Research questions and friction points this paper is trying to address.

Localizing question-relevant intervals in long videos efficiently
Answering questions via span-aware visual token reallocation
Providing interpretable attribution with spans and final answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework with localization and answering phases
Span-grounded QA with time spans and reasoning
Interleaved group-relative objective for end-to-end training
🔎 Similar Papers
C
Chendong Wang
University of Wisconsin–Madison
D
Donglin Bai
Microsoft Research Asia
Y
Yifan Yang
Microsoft Research Asia
Xiao Jin
Xiao Jin
CUHK
CV && RecSys
Anlan Zhang
Anlan Zhang
Adobe Research; University of Southern California
Mobile ComputingNetworked SystemsVR/AR/MR
R
Rui Wang
Microsoft Research Asia
S
Shiqi Jiang
Microsoft Research Asia
Y
Yuqing Yang
Microsoft Research Asia
H
Hao Wu
Microsoft Research Asia
Q
Qi Dai
Microsoft Research Asia
Chong Luo
Chong Luo
Microsoft Research
multimedia communicationscomputer vision
T
Ting Cao
Microsoft Research Asia
Lili Qiu
Lili Qiu
NAI Fellow, ACM Fellow, IEEE Fellow, Professor, Dept. of Computer Science, The University of Texas
Wireless NetworksWireless SensingMobile ComputingSystems5G
Suman Banerjee
Suman Banerjee
Department of CSE, IIT Jammu
Algorithmic Data ManagementSocial Network AnalysisGraph Theory and Graph AlgorithmsParameterized Complexity