PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the insufficient semantic alignment in zero-shot skeleton-based action recognition, which stems from the lack of human-object interaction cues and pose-related visual semantics in conventional skeleton representations. To bridge this gap, we propose PoseBridge, a novel framework that, for the first time, introduces intermediate representations from human pose estimation in a zero-shot setting. By leveraging pose as an anchor to extract semantic cues, PoseBridge incorporates a skeleton-conditioned bridging mechanism and a semantic prototype adaptive alignment module to effectively integrate upstream semantic knowledge into the skeleton–text alignment process—without requiring additional RGB modalities or object detectors. Extensive experiments demonstrate that PoseBridge achieves substantial performance gains across multiple benchmarks, including NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, with particularly notable improvements of 13.3–17.4 percentage points over the strongest baseline on the Kinetics PURLS benchmark.

📝 Abstract

Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

zero-shot skeleton-based action recognition

skeleton-text alignment

human pose estimation

semantic loss

action recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot action recognition

skeleton-based action recognition

human pose estimation