🤖 AI Summary
Video Question Answering (VideoQA) suffers from limited modeling of dynamic visual relationships due to the inadequate temporal representation capability of conventional vision-language models (VLMs) for long-duration videos. To address this, we propose a novel structured video representation paradigm: explicitly encoding videos as temporally ordered subject–predicate–object (SPO) triplet sets, enabling decomposable and interpretable relational reasoning. Methodologically, we introduce the first unordered set alignment framework for relation extraction, integrating spatiotemporal scene graph–inspired relational modeling, language-embedding-guided triplet representation, multi-to-multi noisy contrastive estimation (MM-NCE) loss, and a Q-Former–based collaborative architecture to achieve fine-grained alignment between video queries and textual relation descriptions. Our approach significantly outperforms global representation methods (e.g., CLS token or patch-based aggregation) across five benchmarks—including NeXT-QA and Intent-QA—and establishes new state-of-the-art performance on temporal reasoning and complex relational understanding tasks.
📝 Abstract
Video-Question-Answering (VideoQA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Video Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding them into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of ( extit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. At inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for VideoQA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released.