Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the end-to-end translation of continuous sign language videos into spoken-language text. We propose an LLM-driven method that integrates heterogeneous contextual cues from multiple sources: automatically extracted program subtitles (cross-modal context), preceding translations (temporal context), and pseudo-part-of-speech tags (syntax-aware context), all grounded in visual features. These cues are jointly modeled via multimodal encoding and fine-tuned large language models. To our knowledge, this is the first approach to systematically unify cross-modal, temporal, and syntax-aware contextual signals, thereby significantly improving long-range dependency modeling and ambiguity resolution. Our method achieves substantial performance gains over prior state-of-the-art on BOBSL and attains new SOTA results on How2Sign. Ablation studies confirm the statistically significant contribution of each contextual cue.

Technology Category

Application Category

📝 Abstract

Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.

Problem

Research questions and friction points this paper is trying to address.

Sign Language Translation

Textual Accuracy

Contextual Information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sign Language Translation

Multimodal Information

Multilingual Model

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale