Sign Language Translation with Sentence Embedding Supervision

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing sign language translation (SLT) systems heavily rely on scarce and inconsistent gloss annotations, hindering end-to-end modeling and generalization. This work proposes the first gloss-free end-to-end SLT framework, introducing target-language sentence embeddings—not word-level glosses—as weak supervision signals for self-supervised training. Our method integrates multilingual sentence embedding models with neural machine translation architectures, requiring only raw video–text pairs without gloss annotations or auxiliary pretraining data. Evaluated on PHOENIX-2014T and How2Sign, it significantly narrows the performance gap between fully supervised and annotation-free approaches, achieving state-of-the-art results under unsupervised and weakly supervised settings. The framework is scalable, language-agnostic, and particularly effective for low-resource sign languages, establishing a new paradigm for multilingual, annotation-efficient SLT.

Technology Category

Application Category

📝 Abstract

State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.

Problem

Research questions and friction points this paper is trying to address.

Addressing gloss annotation scarcity in sign language translation

Proposing sentence embedding supervision without manual annotations

Enhancing multilingual sign language translation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sentence embeddings as gloss supervision

Learns supervision from raw text without annotation

Enables multilingual sign language translation systems

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale