From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Egocentric visual question answering (VQA) in HD-EPIC poses significant challenges in fine-grained interaction understanding and cross-temporal-spatial reasoning from first-person videos. Method: We propose SceneNet-KnowledgeNet, a dual-graph collaborative framework: SceneNet leverages multimodal large models to generate structured scene graphs explicitly encoding objects, actions, spatial, and temporal relations; KnowledgeNet integrates ConceptNet commonsense knowledge to construct semantically enriched graphs. Both graphs are jointly optimized via graph neural networks to enable interpretable, cross-level alignment between visual representations and commonsense reasoning. Contribution/Results: This work overcomes unimodal limitations and is the first to synergistically model scene graphs and knowledge graphs for egocentric VQA. Evaluated on seven complex tasks in HD-EPIC 2025, our method achieves 44.21% accuracy—substantially outperforming baselines—and demonstrates the effectiveness and generalizability of graph-structured joint representation for high-difficulty first-person reasoning.

Technology Category

Application Category

📝 Abstract
This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet's external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.
Problem

Research questions and friction points this paper is trying to address.

Leveraging scene graphs for fine-grained object interactions and spatial relationships
Incorporating external commonsense knowledge for high-level semantic connections
Improving accuracy in complex egocentric visual question answering tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

SceneNet uses MLLM for scene graph generation
KnowledgeNet integrates ConceptNet for semantic reasoning
Combining both boosts egocentric VQA accuracy
🔎 Similar Papers
No similar papers found.