An Effective End-to-End Solution for Multimodal Action Recognition

📅 2025-06-11

🏛️ International Conference on Pattern Recognition

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the performance bottleneck in trimodal action recognition caused by data scarcity, this paper proposes an end-to-end lightweight solution. First, it adapts a pre-trained RGB model to the trimodal task via transfer learning and augments the small-scale training set with targeted data augmentation. Second, it jointly models efficient spatiotemporal features using a 2D CNN integrated with the Temporal Shift Module (TSM). Third, it enhances prediction robustness through Stochastic Weight Averaging (SWA), model ensembling, and Test-Time Augmentation (TTA). This work is the first to systematically integrate data expansion, lightweight spatiotemporal modeling, and multi-stage inference enhancement for trimodal action recognition—achieving high computational efficiency without compromising generalization. Experiments demonstrate state-of-the-art performance on the competition leaderboard, attaining 99% Top-1 and 100% Top-5 accuracy, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract

Recently, multimodal tasks have strongly advanced the field of action recognition with their rich multimodal information. However, due to the scarcity of tri-modal data, research on tri-modal action recognition tasks faces many challenges. To this end, we have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information. First, the existing data are transformed and expanded by optimizing data enhancement techniques to enlarge the training scale. At the same time, more RGB datasets are used to pre-train the backbone network, which is better adapted to the new task by means of transfer learning. Secondly, multimodal spatial features are extracted with the help of 2D CNNs and combined with the Temporal Shift Module (TSM) to achieve multimodal spatial-temporal feature extraction comparable to 3D CNNs and improve the computational efficiency. In addition, common prediction enhancement methods, such as Stochastic Weight Averaging (SWA), Ensemble and Test-Time augmentation (TTA), are used to integrate the knowledge of models from different training periods of the same architecture and different architectures, so as to predict the actions from different perspectives and fully exploit the target information. Ultimately, we achieved the Top-1 accuracy of 99% and the Top-5 accuracy of 100% on the competition leaderboard, demonstrating the superiority of our solution.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of tri-modal data for action recognition

Enhances multimodal spatial-temporal feature extraction efficiency

Improves accuracy using ensemble and prediction enhancement methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized data enhancement for training scale expansion

2D CNNs and TSM for spatial-temporal feature extraction

SWA, Ensemble, TTA for multi-perspective prediction

🔎 Similar Papers

C3T: Cross-modal Transfer Through Time for Human Action Recognition