FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video question answering (VQA) methods rely heavily on event-centric question-answer pairs, failing to explicitly model fundamental scene elements—such as object categories, spatial configurations, and descriptive attributes—thereby limiting generalization and higher-order reasoning capabilities. To address this, we propose FIQ, a framework that first automatically generates structured questions from video captions, explicitly covering these foundational scene aspects. Second, we introduce the VQ-CAlign module, which enables fine-grained cross-modal alignment between visual features and question semantics via contrastive learning. Crucially, FIQ requires no additional human annotation, yet significantly enhances low-level scene understanding and contextual modeling. Evaluated on SUTD-TrafficQA, FIQ achieves state-of-the-art performance, outperforming strong baselines by a substantial margin. Our results empirically validate that grounding video QA in foundational, descriptive questions is both effective and essential for robust scene comprehension and reasoning.

Technology Category

Application Category

📝 Abstract
Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model's capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Enhance video understanding beyond event-centric annotations
Generate Q&A pairs to capture fundamental scene details
Improve model generalization and reasoning in VQA tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates Q&A pairs from video descriptions
Integrates question embeddings with visual features
Enhances model reasoning and generalizability
🔎 Similar Papers
No similar papers found.
J
Ju-Young Oh
Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea
Ho-Joong Kim
Ho-Joong Kim
Korea University
computer vision
S
Seong-Whan Lee
Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea