🤖 AI Summary
Existing 3D video action recognition models suffer from excessive computational overhead, hindering deployment on resource-constrained edge devices. Method: We propose an efficient edge-oriented architecture—a lightweight 3D network integrating depthwise separable 3D convolutions with linear-complexity attention to capture long-range temporal dependencies—complemented by channel–spatiotemporal joint pruning for parameter and FLOPs reduction. Furthermore, we design a customized quantization scheme ensuring training stability and inference efficiency, supporting low-bitweight/activation co-optimization and hardware-friendly integer-only inference. Results: On Kinetics-400 and Something-Something V2 benchmarks, our method achieves comparable or superior accuracy using less than 30% of the computational cost of mainstream 3D models, enabling real-time applications such as smart-home surveillance and remote healthcare behavior analysis.
📝 Abstract
In this paper, we introduce a deep learning solution for video activity recognition that leverages an innovative combination of convolutional layers with a linear-complexity attention mechanism. Moreover, we introduce a novel quantization mechanism to further improve the efficiency of our model during both training and inference. Our model maintains a reduced computational cost, while preserving robust learning and generalization capabilities. Our approach addresses the issues related to the high computing requirements of current models, with the goal of achieving competitive accuracy on consumer and edge devices, enabling smart home and smart healthcare applications where efficiency and privacy issues are of concern. We experimentally validate our model on different established and publicly available video activity recognition benchmarks, improving accuracy over alternative models at a competitive computing cost.