🤖 AI Summary
Egocentric visual question answering (VQA) in HD-EPIC poses significant challenges in fine-grained interaction understanding and cross-temporal-spatial reasoning from first-person videos.
Method: We propose SceneNet-KnowledgeNet, a dual-graph collaborative framework: SceneNet leverages multimodal large models to generate structured scene graphs explicitly encoding objects, actions, spatial, and temporal relations; KnowledgeNet integrates ConceptNet commonsense knowledge to construct semantically enriched graphs. Both graphs are jointly optimized via graph neural networks to enable interpretable, cross-level alignment between visual representations and commonsense reasoning.
Contribution/Results: This work overcomes unimodal limitations and is the first to synergistically model scene graphs and knowledge graphs for egocentric VQA. Evaluated on seven complex tasks in HD-EPIC 2025, our method achieves 44.21% accuracy—substantially outperforming baselines—and demonstrates the effectiveness and generalizability of graph-structured joint representation for high-difficulty first-person reasoning.
📝 Abstract
This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet's external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.