GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
To address insufficient modeling of human-object interactions in video question answering (Video QA), this paper proposes a human-centric Video QA framework. Methodologically, it constructs a cross-frame global scene graph rooted at humans to explicitly model spatiotemporally consistent human-object interactions; employs a graph neural network (GNN) to learn context-aware node embeddings; and integrates a hierarchical feature fusion network for multi-granularity semantic alignment and reasoning. The key contribution lies in the first deep integration of structured scene graphs with GNNs for video-level temporal modeling—uniquely balancing interpretability and fine-grained relational reasoning. On the AGQA benchmark, the approach achieves a 7.3% improvement in object-relation reasoning accuracy over prior state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.
Problem

Research questions and friction points this paper is trying to address.

Capturing intricate human-object interactions in video sequences for question answering
Enabling cross-frame reasoning centered on human actors using scene graphs
Enhancing interpretability by decomposing actions into human-object interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene graphs capture human-object interactions in videos
Graph Neural Networks process video-level relational graphs
Hierarchical network integrates embeddings with question features