Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VQA models are highly susceptible to training data biases, over-relying on superficial statistical correlations and thus exhibiting weak vision-language joint reasoning and poor generalization. To address this, we propose IOG-VQA—a novel framework that jointly integrates object-interaction self-attention with a GAN-driven debiasing module. The former explicitly models fine-grained spatial and semantic interactions among visual objects to enrich contextual visual representations; the latter performs adversarial distribution alignment in feature space to mitigate question-answer co-occurrence bias. Crucially, IOG-VQA enables end-to-end joint optimization of vision-language feature alignment. Evaluated on VQA-CP v1 and v2—benchmarks designed for out-of-distribution generalization—IOG-VQA achieves state-of-the-art performance, demonstrating substantial robustness gains under skewed data distributions. Our results empirically validate that co-modeling object interactions and data bias mitigation significantly enhances VQA reasoning capability.

Technology Category

Application Category

📝 Abstract
Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.
Problem

Research questions and friction points this paper is trying to address.

Addressing biases in VQA models from training data
Capturing complex object interactions within visual content
Improving generalization across diverse questions and images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object Interaction Self-Attention captures complex image object relationships
GAN-Based Debiasing framework generates unbiased data distributions
Integration combines visual and textual information to reduce biases
🔎 Similar Papers
No similar papers found.
Zhifei Li
Zhifei Li
Research Scientist at Google
machine translationnatural language processingmachine learningwireless networks
Feng Qiu
Feng Qiu
Argonne National Laboratory
Mathematical programmingoptimizationpower systemsenergy systems
Y
Yiran Wang
School of Computer Science, Hubei University, Wuhan 430062, China, Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China, and Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China
Y
Yujing Xia
School of Computer Science, Hubei University, Wuhan 430062, China, Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China, and Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China
K
Kui Xiao
School of Computer Science, Hubei University, Wuhan 430062, China, Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China, and Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China
M
Miao Zhang
School of Computer Science, Hubei University, Wuhan 430062, China, Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China, and Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China
Y
Yan Zhang
School of Computer Science, Hubei University, Wuhan 430062, China, Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China, and Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China