GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient modeling of human-object interactions in video question answering (Video QA), this paper proposes a human-centric Video QA framework. Methodologically, it constructs a cross-frame global scene graph rooted at humans to explicitly model spatiotemporally consistent human-object interactions; employs a graph neural network (GNN) to learn context-aware node embeddings; and integrates a hierarchical feature fusion network for multi-granularity semantic alignment and reasoning. The key contribution lies in the first deep integration of structured scene graphs with GNNs for video-level temporal modeling—uniquely balancing interpretability and fine-grained relational reasoning. On the AGQA benchmark, the approach achieves a 7.3% improvement in object-relation reasoning accuracy over prior state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.
Problem

Research questions and friction points this paper is trying to address.

Capturing intricate human-object interactions in video sequences for question answering
Enabling cross-frame reasoning centered on human actors using scene graphs
Enhancing interpretability by decomposing actions into human-object interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene graphs capture human-object interactions in videos
Graph Neural Networks process video-level relational graphs
Hierarchical network integrates embeddings with question features
🔎 Similar Papers
No similar papers found.
D
Dionysia Danai Brilli
School of ECE, National Technical University of Athens, Athens, Greece
D
Dimitrios Mallis
University of Luxembourg, Kirchberg, Luxembourg
Vassilis Pitsikalis
Vassilis Pitsikalis
deeplab.ai, Athens, Greece
Petros Maragos
Petros Maragos
Professor of Electrical and Computer Engineering, National Technical University of Athens
computer visionsignal processingspeech&languagemachine learningrobotics