Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the limited cross-domain generalization in zero-shot action recognition caused by variations in camera viewpoints and human body orientations. To mitigate this issue, the authors propose an orientation-aware multi-view motion encoding network that integrates multi-view motion features with textual action descriptions. During training, the model constructs orientation-aligned action representations, and at inference time, it achieves semantic alignment by matching orientation-adapted textual prompts. This approach significantly enhances generalization to unseen actions and cross-view scenarios. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks—including NTU-RGB+D, BABEL, NW-UCLA, and two surveillance datasets—outperforming existing zero-shot methods and exhibiting strong transferability.

📝 Abstract

Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge. Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training. In practice, variations in human body orientation and camera viewpoint add a significant domain gap in ZSAR, substantially limiting generalization to novel action-motion combinations. In this context, this paper presents a novel orientation-aware action recognition approach with improved cross-domain capabilities. Our approach combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase. We present a new orientation-aware motion encoding network to learn different motion features, and adapt a specific orientation-aware text prompt to match the corresponding features at inference. Extensive experiments demonstrate that the proposed method consistently improves ZSAR performance across different recognition benchmarks, outperforming recent state-of-the-art zero-shot approaches on NTU-RGB+D, BABEL, NW-UCLA, and on two surveillance datasets. In addition, the learned representations exhibit strong transfer learning capabilities, yielding competitive performance on both cross-domain and same-domain recognition of seen actions. Code and trained models are available at: https://icb-vision-ai.github.io/OrientationAware-HAR

Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Action Recognition

Cross-Domain

Domain Shift

Viewpoint Variation

Human Action Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Action Recognition

Cross-Domain Generalization

Orientation-Aware Encoding