๐ค AI Summary
To address the lack of object-interaction semantic modeling in skeleton-based action recognition, this paper proposes an object-aware end-to-end action understanding framework. The core method introduces learnable and optimizable object representations to explicitly model spatio-temporal associations between human joints and scene objects. It incorporates an attention-driven interaction modeling mechanism, a multimodal feature fusion module, and a differentiable object-relation reasoning moduleโmarking the first integration of explicit object information into graph convolutional network (GCN)-based skeleton action recognition architectures. Evaluated on NTU-60 and NTU-120 benchmarks, the approach achieves absolute accuracy improvements of 2.3% and 1.9%, respectively, significantly outperforming skeleton-only baselines. These results empirically validate the critical contribution of object-interaction semantics to action recognition performance.