🤖 AI Summary
This work introduces the first unified sign language understanding model that jointly addresses sign language translation (SLT) and sign language–subtitle alignment (SSA), enabling end-to-end conversion of continuous sign language videos into spoken-language text with precise temporal localization. Methodologically, it employs a lightweight multimodal input representation (pose keypoints + lip images), a sliding-window Perceiver mapping network, and a scalable multi-task training framework—marking the first effort to jointly optimize SLT and SSA while supporting cross-lingual zero-shot transfer. Pretrained on multilingual BSL and ASL data, the model achieves state-of-the-art performance on both SLT and SSA tasks on BOBSL, and significantly outperforms prior methods on How2Sign, demonstrating strong generalization and fine-tuning efficacy. This unified, efficient, and extensible framework advances sign language education, accessibility, and large-scale corpus construction.
📝 Abstract
Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.