๐ค AI Summary
This work addresses the video-based multi-personโobject interaction (HOI) recognition task. We propose a Geometry-Visual Graph Neural Network (GV-GNN) that jointly models 3D human pose geometry, visual appearance features, and spatiotemporal dynamics across persons and objects. Methodologically, we explicitly incorporate 3D pose geometric priors into dynamic graph construction, design a cross-subject interaction attention mechanism, and integrate multi-scale spatiotemporal convolutions with differentiable geometric graph pooling for fine-grained joint inference. On CAD-120, V-COCO, and HICO-DET, GV-GNN achieves consistent mAP improvements of 3.2โ5.7%, significantly enhancing robustness to occlusion and dense interactions. To our knowledge, this is the first work to systematically embed explicit 3D geometric priors into HOI graph modeling, establishing a novel multimodal spatiotemporal interaction understanding paradigm.