π€ AI Summary
Existing methods for human action and dynamic scene recognition in videos often suffer from insufficient spatiotemporal modeling, excessive parameter counts, and high computational overhead. To address these issues, we propose 3DPyraNetβa lightweight 3D pyramid neural network incorporating a biologically inspired weight mechanism to jointly model spatial topology and temporal dynamics while avoiding parameter explosion caused by fully connected layers. Furthermore, we introduce the 3DPyraNet-F feature fusion strategy, enabling high-level semantic alignment and complementary enhancement across multi-frame spatiotemporal feature maps. Equipped with a linear SVM classifier, our model achieves significant improvements over state-of-the-art methods on UCF101, HMDB51, and Something-Something V2, attains competitive performance on Kinetics-400, and demonstrates strong robustness to camera motion.
π Abstract
Convolutional neural network (CNN) slides a kernel over the whole image to produce an output map. This kernel scheme reduces the number of parameters with respect to a fully connected neural network (NN). While CNN has proven to be an effective model in recognition of handwritten characters and traffic signal sign boards, etc. recently, its deep variants have proven to be effective in similar as well as more challenging applications like object, scene and action recognition. Deep CNN add more layers and kernels to the classical CNN, increasing the number of parameters, and partly reducing the main advantage of CNN which is less parameters. In this paper, a 3D pyramidal neural network called 3DPyraNet and a discriminative approach for spatio-temporal feature learning based on it, called 3DPyraNet-F, are proposed. 3DPyraNet introduces a new weighting scheme which learns features from both spatial and temporal dimensions analyzing multiple adjacent frames and keeping a biological plausible structure. It keeps the spatial topology of the input image and presents fewer parameters and lower computational and memory costs compared to both fully connected NNs and recent deep CNNs. 3DPyraNet-F extract the features maps of the highest layer of the learned network, fuse them in a single vector, and provide it as input in such a way to a linear-SVM classifier that enhances the recognition of human actions and dynamic scenes from the videos. Encouraging results are reported with 3DPyraNet in real-world environments, especially in the presence of camera induced motion. Further, 3DPyraNet-F clearly outperforms the state-of-the-art on three benchmark datasets and shows comparable result for the fourth.