Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos

📅 2025-06-01

🏛️ Expert systems with applications

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the video-based multi-person–object interaction (HOI) recognition task. We propose a Geometry-Visual Graph Neural Network (GV-GNN) that jointly models 3D human pose geometry, visual appearance features, and spatiotemporal dynamics across persons and objects. Methodologically, we explicitly incorporate 3D pose geometric priors into dynamic graph construction, design a cross-subject interaction attention mechanism, and integrate multi-scale spatiotemporal convolutions with differentiable geometric graph pooling for fine-grained joint inference. On CAD-120, V-COCO, and HICO-DET, GV-GNN achieves consistent mAP improvements of 3.2–5.7%, significantly enhancing robustness to occlusion and dense interactions. To our knowledge, this is the first work to systematically embed explicit 3D geometric priors into HOI graph modeling, establishing a novel multimodal spatiotemporal interaction understanding paradigm.