🤖 AI Summary
This work addresses the limitations of existing sign language recognition methods that rely on generic action pre-trained encoders, which struggle to capture subtle hand articulation differences. To overcome this, the authors propose a segmentation-driven self-supervised pre-training framework that treats hand poses as dynamic visual units by selectively masking the spatiotemporal presence and motion of key body parts—particularly the hands—in a data-driven manner. This approach enables fine-grained representation learning tailored to sign language. Evaluated on the WLASL, NMFs-CSL, and Slovo datasets, the method achieves state-of-the-art performance, significantly improving both instance-level and category-level Top-1 accuracy while using fewer video frames and modalities compared to prior approaches.
📝 Abstract
Subtle hand differences make sign language recognition challenging, yet many existing methods rely on encoders pretrained on generic action datasets that poorly capture such fine-grained cues. We propose a self-supervised pretraining method for sign language recognition that uses segmentation-based masking to adapt to the presence and motion of key body parts, rather than treating hand poses as static visual tokens. The resulting mask-and-reconstruct objective improves fine-grained sign representation learning. On WLASL, NMFs-CSL, and Slovo, our encoder achieves state-of-the-art performance, improving per-instance and per-class Top-1 accuracy while using fewer input frames and modalities than comparable encoders.