Deep Understanding of Sign Language for Sign to Subtitle Alignment

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of scarce annotated data and low temporal alignment accuracy in sign language video–caption alignment. To tackle these issues, we propose a temporal alignment framework integrating linguistic priors with self-supervised learning. Methodologically: (1) We preprocess captions using British Sign Language (BSL) grammar constraints to enhance linguistic structural coherence; (2) we introduce a selective alignment loss that applies supervision only during actual sign production intervals; and (3) we replace noisy audio-heuristic alignment labels with high-confidence pseudo-labels generated self-supervisively. Experiments demonstrate substantial improvements over prior methods in frame-level accuracy and F1 score, establishing new state-of-the-art performance on sign language video–text temporal alignment. Our approach provides a scalable, low-resource paradigm for sign language understanding, particularly beneficial where manual annotation is prohibitively expensive or unavailable.

Technology Category

Application Category

📝 Abstract
The objective of this work is to align asynchronous subtitles in sign language videos with limited labelled data. To achieve this goal, we propose a novel framework with the following contributions: (1) we leverage fundamental grammatical rules of British Sign Language (BSL) to pre-process the input subtitles, (2) we design a selective alignment loss to optimise the model for predicting the temporal location of signs only when the queried sign actually occurs in a scene, and (3) we conduct self-training with refined pseudo-labels which are more accurate than the heuristic audio-aligned labels. From this, our model not only better understands the correlation between the text and the signs, but also holds potential for application in the translation of sign languages, particularly in scenarios where manual labelling of large-scale sign data is impractical or challenging. Extensive experimental results demonstrate that our approach achieves state-of-the-art results, surpassing previous baselines by substantial margins in terms of both frame-level accuracy and F1-score. This highlights the effectiveness and practicality of our framework in advancing the field of sign language video alignment and translation.
Problem

Research questions and friction points this paper is trying to address.

Align asynchronous subtitles in sign language videos
Improve temporal location prediction of signs
Enhance sign language translation with limited labelled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes BSL grammatical rules for subtitle pre-processing
Implements selective alignment loss for temporal sign prediction
Employs self-training with refined pseudo-labels for accuracy
🔎 Similar Papers
No similar papers found.