🤖 AI Summary
Skeleton-based action recognition has long suffered from scarce labeled data and challenges in modeling short- and long-term temporal dependencies. To address these issues, this paper proposes LSTC-MDA, a unified framework featuring two core innovations: (1) a Long-Short-Term Parallel Temporal Convolution (LSTC) module that adaptively fuses multi-scale temporal features; and (2) a view-aware hybrid data augmentation strategy, integrating joint-level mixed data augmentation (JMDA) with input-layer Additive Mixup, enhanced by a similarity-weighted alignment mechanism to mitigate distribution shift caused by cross-view mixing. Evaluated on NTU-60, NTU-120, and NW-UCLA benchmarks, LSTC-MDA achieves state-of-the-art accuracies of 94.1%, 90.4%, and 97.2%, respectively—outperforming all existing methods. The framework establishes a novel paradigm for few-shot temporal modeling in skeleton-based action recognition.
📝 Abstract
Skeleton-based action recognition faces two longstanding challenges: the scarcity of labeled training samples and difficulty modeling short- and long-range temporal dependencies. To address these issues, we propose a unified framework, LSTC-MDA, which simultaneously improves temporal modeling and data diversity. We introduce a novel Long-Short Term Temporal Convolution (LSTC) module with parallel short- and long-term branches, these two feature branches are then aligned and fused adaptively using learned similarity weights to preserve critical long-range cues lost by conventional stride-2 temporal convolutions. We also extend Joint Mixing Data Augmentation (JMDA) with an Additive Mixup at the input level, diversifying training samples and restricting mixup operations to the same camera view to avoid distribution shifts. Ablation studies confirm each component contributes. LSTC-MDA achieves state-of-the-art results: 94.1% and 97.5% on NTU 60 (X-Sub and X-View), 90.4% and 92.0% on NTU 120 (X-Sub and X-Set),97.2% on NW-UCLA. Code: https://github.com/xiaobaoxia/LSTC-MDA.