🤖 AI Summary
To address the challenges of tightly coupled temporal localization and answer reasoning, along with high computational overhead in long-video question answering (QA), this paper proposes a two-stage interpretable QA framework. In the first stage, a low-frame-rate video summary enables coarse-grained temporal localization of question-relevant segments. In the second stage, span-aware visual token reallocation operates at a higher effective frame rate, jointly optimizing temporal span prediction and multiple-choice answer selection. We introduce a novel multiple-choice QA dataset with explicit temporal span annotations and design an interleaved grouped relative target loss to backpropagate answer correctness gradients to the temporal localization module, enabling end-to-end attributable training. Our coupling loss integrates temporal Intersection-over-Union (tIoU) and answer accuracy under a fixed token budget. Compared to uniform sampling, our method reduces input frames by 50% while achieving up to 8.6% performance gains on Charades-STA and ActivityNet-Captions, significantly outperforming existing approaches.
📝 Abstract
We present emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first emph{localizing} question-relevant interval(s) with a low-fps skim and then emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce dataname{}, which converts description based event graphs into emph{span-grounded} multiple-choice QA by pairing each question with emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.