TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the joint task of temporal grounding and open-ended answer generation in weakly supervised video question answering—without any temporal annotations. Methodologically, it introduces the first end-to-end joint decoding framework featuring: (1) instruction tuning of a vision-language large model; (2) a pseudo-label verification mechanism enforcing cross-response consistency, eliminating reliance on manual temporal annotations; and (3) integrated pseudo-label generation and consistency regularization to jointly optimize answer generation and temporal localization (i.e., start/end timestamps). Evaluated on NExT-GQA, MSVD-QA, and ActivityNet-QA, the method achieves state-of-the-art performance across all benchmarks, simultaneously improving both QA accuracy and temporal grounding precision. To our knowledge, it is the first approach to achieve high-quality joint generation under *completely unsupervised temporal grounding*, i.e., without any time-aligned supervision.

Technology Category

Application Category

📝 Abstract
We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Video QA with temporal grounding without annotations
Generating answers and temporal grounding jointly
Weakly supervised setup for improved QA performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly generate answer and temporal grounding
Weak supervision with pseudo temporal labels
Consistency constraint ensures label validity
🔎 Similar Papers
No similar papers found.