Heatmap Pooling Network for Action Recognition from RGB Videos

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address information redundancy, noise sensitivity, and high storage overhead in RGB-based video action recognition, this paper proposes the Heatmap Pooling Network (HP-Net). Methodologically, HP-Net introduces two key innovations: (1) a novel feedback pooling module that adaptively focuses on discriminative human body regions, yielding compact, robust, and highly discriminative pooled features; and (2) a spatial-motion collaborative learning module integrated with text-guided refinement modulation, enabling deep fusion of visual representations and semantic priors. The entire model is end-to-end trainable and supports multimodal joint optimization. Extensive experiments demonstrate state-of-the-art performance on major benchmarks—including NTU RGB+D 60/120, Toyota Smarthome, and UAV-Human—significantly outperforming existing approaches. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.

Problem

Research questions and friction points this paper is trying to address.

Extracts robust pooled features from RGB videos

Integrates multimodal data for action recognition

Addresses redundancy, noise, and storage cost issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback pooling module extracts robust video features

Spatial-motion co-learning integrates multimodal data

Text refinement modulation enhances action recognition

🔎 Similar Papers

Collaboratively Self-supervised Video Representation Learning for Action Recognition