Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Large semantic gaps between visual and linguistic modalities, coupled with the difficulty of modeling subtle handshape and motion variations, pose significant challenges in sign language translation. To address these issues, this paper proposes a vocabulary-free, end-to-end video large language model framework. Our method introduces (1) a fine-grained hand motion text generator that explicitly models spatiotemporal dynamics, and (2) a contrastive alignment module integrated with HaMeR feature distillation to enhance hand-centric cross-modal representation learning, effectively mitigating modality discrepancies. The model is optimized via contrastive loss-based pretraining to improve alignment fidelity. Evaluated on Phoenix14T and CSL-Daily benchmarks, our approach achieves state-of-the-art performance—demonstrating substantial improvements in translation accuracy without lexical constraints. These results validate the effectiveness and generalizability of our framework for weakly supervised sign language understanding.

Technology Category

Application Category

📝 Abstract

Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce extbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. extbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

Problem

Research questions and friction points this paper is trying to address.

Bridging modality gap between visual and linguistic information

Capturing subtle hand shape and movement variations

Improving gloss-free sign language translation accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gloss-free SLT using VideoLLMs for spatio-temporal reasoning

Fine-grained hand motion descriptions with contrastive alignment

Distilled HaMeR features enrich hand-specific representations

🔎 Similar Papers

An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs