Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

๐Ÿ“… 2025-01-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the end-to-end translation of continuous sign language videos into spoken-language text. We propose an LLM-driven method that integrates heterogeneous contextual cues from multiple sources: automatically extracted program subtitles (cross-modal context), preceding translations (temporal context), and pseudo-part-of-speech tags (syntax-aware context), all grounded in visual features. These cues are jointly modeled via multimodal encoding and fine-tuned large language models. To our knowledge, this is the first approach to systematically unify cross-modal, temporal, and syntax-aware contextual signals, thereby significantly improving long-range dependency modeling and ambiguity resolution. Our method achieves substantial performance gains over prior state-of-the-art on BOBSL and attains new SOTA results on How2Sign. Ablation studies confirm the statistically significant contribution of each contextual cue.

Technology Category

Application Category

๐Ÿ“ Abstract
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
Problem

Research questions and friction points this paper is trying to address.

Sign Language Translation
Textual Accuracy
Contextual Information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sign Language Translation
Multimodal Information
Multilingual Model
Youngjoon Jang
Youngjoon Jang
KAIST
Computer VisionMachine Learning
H
Haran Raajesh
CVIT, IIIT Hyderabad, India; LIGM, ร‰cole des Ponts, Univ Gustave Eiffel, CNRS, France
Liliane Momeni
Liliane Momeni
University of Oxford
Computer VisionMachine LearningArtificial Intelligence
G
Gรผl Varol
LIGM, ร‰cole des Ponts, Univ Gustave Eiffel, CNRS, France
Andrew Zisserman
Andrew Zisserman
University of Oxford
Computer VisionMachine Learning