Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first unified sign language understanding model that jointly addresses sign language translation (SLT) and sign language–subtitle alignment (SSA), enabling end-to-end conversion of continuous sign language videos into spoken-language text with precise temporal localization. Methodologically, it employs a lightweight multimodal input representation (pose keypoints + lip images), a sliding-window Perceiver mapping network, and a scalable multi-task training framework—marking the first effort to jointly optimize SLT and SSA while supporting cross-lingual zero-shot transfer. Pretrained on multilingual BSL and ASL data, the model achieves state-of-the-art performance on both SLT and SSA tasks on BOBSL, and significantly outperforms prior methods on How2Sign, demonstrating strong generalization and fine-tuning efficacy. This unified, efficient, and extensible framework advances sign language education, accessibility, and large-scale corpus construction.

Technology Category

Application Category

📝 Abstract
Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.
Problem

Research questions and friction points this paper is trying to address.

Develop a unified model for sign language translation and alignment.
Convert signing videos to text and align signs with subtitles.
Enable cross-linguistic generalization across different sign languages.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight visual backbone using keypoints and lip images
Sliding Perceiver network mapping visual features to word embeddings
Multi-task training jointly optimizing translation and alignment
🔎 Similar Papers
No similar papers found.