SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the limitations of existing sign language recognition methods that rely on generic action pre-trained encoders, which struggle to capture subtle hand articulation differences. To overcome this, the authors propose a segmentation-driven self-supervised pre-training framework that treats hand poses as dynamic visual units by selectively masking the spatiotemporal presence and motion of key body parts—particularly the hands—in a data-driven manner. This approach enables fine-grained representation learning tailored to sign language. Evaluated on the WLASL, NMFs-CSL, and Slovo datasets, the method achieves state-of-the-art performance, significantly improving both instance-level and category-level Top-1 accuracy while using fewer video frames and modalities compared to prior approaches.
📝 Abstract
Subtle hand differences make sign language recognition challenging, yet many existing methods rely on encoders pretrained on generic action datasets that poorly capture such fine-grained cues. We propose a self-supervised pretraining method for sign language recognition that uses segmentation-based masking to adapt to the presence and motion of key body parts, rather than treating hand poses as static visual tokens. The resulting mask-and-reconstruct objective improves fine-grained sign representation learning. On WLASL, NMFs-CSL, and Slovo, our encoder achieves state-of-the-art performance, improving per-instance and per-class Top-1 accuracy while using fewer input frames and modalities than comparable encoders.
Problem

Research questions and friction points this paper is trying to address.

sign language recognition
fine-grained cues
hand differences
self-supervised learning
segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning
segmentation-driven masking
sign language recognition
mask-and-reconstruct
fine-grained representation