FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing video question answering (VQA) methods rely heavily on event-centric question-answer pairs, failing to explicitly model fundamental scene elements—such as object categories, spatial configurations, and descriptive attributes—thereby limiting generalization and higher-order reasoning capabilities. To address this, we propose FIQ, a framework that first automatically generates structured questions from video captions, explicitly covering these foundational scene aspects. Second, we introduce the VQ-CAlign module, which enables fine-grained cross-modal alignment between visual features and question semantics via contrastive learning. Crucially, FIQ requires no additional human annotation, yet significantly enhances low-level scene understanding and contextual modeling. Evaluated on SUTD-TrafficQA, FIQ achieves state-of-the-art performance, outperforming strong baselines by a substantial margin. Our results empirically validate that grounding video QA in foundational, descriptive questions is both effective and essential for robust scene comprehension and reasoning.

Technology Category

Application Category

📝 Abstract

Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model's capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Enhance video understanding beyond event-centric annotations

Generate Q&A pairs to capture fundamental scene details

Improve model generalization and reasoning in VQA tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates Q&A pairs from video descriptions

Integrates question embeddings with visual features

Enhances model reasoning and generalizability

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

VideoQA in the Era of LLMs: An Empirical Study

2024-08-08International Journal of Computer VisionCitations: 13

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence