SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

📅 2024-06-11

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Sign language video-to-text translation suffers from representation learning vulnerability to irrelevant visual distractors and overreliance on unstable keypoint detection. Method: This paper proposes a pose-estimation-free multi-stream self-supervised representation learning framework. It employs three lightweight visual encoders—dedicated to facial, hand, and body regions—coupled with frame-level contrastive learning to achieve semantics-aware, region-specific modeling. Crucially, it eliminates explicit keypoint detection and instead performs end-to-end self-supervised pretraining to jointly optimize linguistic fidelity and computational efficiency. Results: On the How2Sign benchmark, the method achieves state-of-the-art translation performance while incurring less than 3% additional computational overhead and reducing training cost by 97%. This substantial efficiency gain significantly enhances the practicality and deployability of large-scale sign language translation systems.

Technology Category

Application Category

📝 Abstract

A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we learn representations of sign language in an effective and efficient way that can preserve the important attributes of these languages, while remaining invariant to irrelevant visual differences. Informed by the nature and linguistics of signed languages, our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. However, instead of using pose estimation coordinates from off-the-shelf pose tracking models, which have inconsistent performance for hands and faces, we propose to learn the complex handshapes and rich facial expressions of sign languages in a self-supervised fashion. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training. Compared to a recent model that established a new state of the art in sign language translation on the How2Sign dataset, our approach yields similar translation performance, using less than 3% of the compute.

Problem

Research questions and friction points this paper is trying to address.

Efficient representation learning for sign language translation

Focusing on relevant signer parts: face, hands, and body pose

Self-supervised learning of handshapes and facial expressions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focuses on face, hands, and body pose

Self-supervised learning for handshapes and expressions

Efficient frame-based learning approach

🔎 Similar Papers

No similar papers found.