SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limitations of existing sign language recognition methods that rely on generic action pre-trained encoders, which struggle to capture subtle hand articulation differences. To overcome this, the authors propose a segmentation-driven self-supervised pre-training framework that treats hand poses as dynamic visual units by selectively masking the spatiotemporal presence and motion of key body parts—particularly the hands—in a data-driven manner. This approach enables fine-grained representation learning tailored to sign language. Evaluated on the WLASL, NMFs-CSL, and Slovo datasets, the method achieves state-of-the-art performance, significantly improving both instance-level and category-level Top-1 accuracy while using fewer video frames and modalities compared to prior approaches.

📝 Abstract

Subtle hand differences make sign language recognition challenging, yet many existing methods rely on encoders pretrained on generic action datasets that poorly capture such fine-grained cues. We propose a self-supervised pretraining method for sign language recognition that uses segmentation-based masking to adapt to the presence and motion of key body parts, rather than treating hand poses as static visual tokens. The resulting mask-and-reconstruct objective improves fine-grained sign representation learning. On WLASL, NMFs-CSL, and Slovo, our encoder achieves state-of-the-art performance, improving per-instance and per-class Top-1 accuracy while using fewer input frames and modalities than comparable encoders.

Problem

Research questions and friction points this paper is trying to address.

sign language recognition

fine-grained cues

hand differences

self-supervised learning

segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning

segmentation-driven masking

sign language recognition