GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sign language generation (SLG) methods predominantly employ sentence-level conditional encoding, leading to word-order errors, semantic inaccuracies, and ambiguous gestures—primarily due to the lack of fine-grained modeling of lexical semantics and temporal structure. To address this, we propose a gloss-level temporally aligned conditional generation framework. Our key innovation is the Temporal Alignment Conditioning (TAC) module, the first explicit mechanism that dynamically aligns gloss embeddings with motion frames, enabling precise synchronization between semantic units and articulatory timing. The model generates sign motion sequences end-to-end. Evaluated on CSL-Daily and Phoenix-2014T, it achieves significant improvements in gloss-order accuracy and semantic fidelity, outperforming all state-of-the-art SLG approaches. This work advances SLG by bridging the gap between linguistic structure and spatiotemporal gesture realization through lexically grounded, temporally aware conditioning.

Technology Category

Application Category

📝 Abstract
Sign language generation (SLG), or text-to-sign generation, bridges the gap between signers and non-signers. Despite recent progress in SLG, existing methods still often suffer from incorrect lexical ordering and low semantic accuracy. This is primarily due to sentence-level condition, which encodes the entire sentence of the input text into a single feature vector as a condition for SLG. This approach fails to capture the temporal structure of sign language and lacks the granularity of word-level semantics, often leading to disordered sign sequences and ambiguous motions. To overcome these limitations, we propose GLOS, a sign language generation framework with temporally aligned gloss-level conditioning. First, we employ gloss-level conditions, which we define as sequences of gloss embeddings temporally aligned with the motion sequence. This enables the model to access both the temporal structure of sign language and word-level semantics at each timestep. As a result, this allows for fine-grained control of signs and better preservation of lexical order. Second, we introduce a condition fusion module, temporal alignment conditioning (TAC), to efficiently deliver the word-level semantic and temporal structure provided by the gloss-level condition to the corresponding motion timesteps. Our method, which is composed of gloss-level conditions and TAC, generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.
Problem

Research questions and friction points this paper is trying to address.

Incorrect lexical ordering in sign language generation
Low semantic accuracy in text-to-sign methods
Lack of temporal structure in sentence-level conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gloss-level conditions for temporal alignment
TAC module for efficient condition fusion
Fine-grained control of sign sequences
🔎 Similar Papers
No similar papers found.