Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Sign Language Translation (SLT) faces a “pseudo-label dependency bottleneck”: state-of-the-art approaches rely on manually annotated glosses as intermediate representations, incurring high annotation costs and suffering from data scarcity. To address this, we propose the first LLM-driven pseudo-gloss generation framework that eliminates gloss supervision entirely. Our method first employs a large language model with in-context learning to generate initial pseudo-morphemes; then refines video–morpheme alignment via weakly supervised sequence re-ranking; and finally adopts a three-stage cross-modal joint training strategy—co-optimizing the visual encoder and translator—with integrated CTC loss and progressive modality alignment. Crucially, our approach retains structured intermediate representations while removing reliance on expert annotations. On two standard SLT benchmarks, it matches the performance of state-of-the-art gloss-supervised methods and significantly enhances practicality in few-shot and low-resource settings.

Technology Category

Application Category

📝 Abstract

Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for costly expert-annotated gloss labels

Generates pseudo glosses via LLM and aligns with videos

Improves sign language translation without gloss dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates pseudo glosses without human annotations

Weakly supervised learning aligns pseudo glosses with videos

Three-stage pipeline narrows sign-spoken language gap

🔎 Similar Papers

Using an LLM to Turn Sign Spottings into Spoken Language Sentences