🤖 AI Summary
To address insufficient modeling of human-object interactions in video question answering (Video QA), this paper proposes a human-centric Video QA framework. Methodologically, it constructs a cross-frame global scene graph rooted at humans to explicitly model spatiotemporally consistent human-object interactions; employs a graph neural network (GNN) to learn context-aware node embeddings; and integrates a hierarchical feature fusion network for multi-granularity semantic alignment and reasoning. The key contribution lies in the first deep integration of structured scene graphs with GNNs for video-level temporal modeling—uniquely balancing interpretability and fine-grained relational reasoning. On the AGQA benchmark, the approach achieves a 7.3% improvement in object-relation reasoning accuracy over prior state-of-the-art methods.
📝 Abstract
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.