Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition

📅 2024-10-28

🏛️ ACM Multimedia

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the accuracy-efficiency trade-off in skeleton-based sign language recognition—caused by hand pose distortion, poor robustness to joint occlusion/misalignment, and varying morpheme complexity—this paper proposes: (1) a kinematics-constrained hand pose rectification method to enhance skeletal plausibility; (2) a feature isolation mechanism that decouples local spatiotemporal contextual modeling; and (3) an input-adaptive inference strategy that dynamically schedules computation paths according to morpheme complexity. The model integrates skeletal kinematic modeling, decoupled attention, occlusion-aware graph convolution, and adaptive computational scheduling. It establishes new state-of-the-art performance on WLASL100 and LSA64, achieving Top-1 accuracies of 86.50% (+2.39%) and 99.84%, respectively.

Technology Category

Application Category

📝 Abstract

Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%.

Problem

Research questions and friction points this paper is trying to address.

Enhancing realism in hand skeletal representations for SLR.

Addressing missing data impact with feature-isolated mechanisms.

Adapting to varying complexity levels in sign glosses.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kinematic hand pose rectification for realistic constraints

Feature-isolated mechanism for local spatial-temporal context

Input-adaptive inference for varying complexity levels

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale