🤖 AI Summary
Large semantic gaps between visual and linguistic modalities, coupled with the difficulty of modeling subtle handshape and motion variations, pose significant challenges in sign language translation. To address these issues, this paper proposes a vocabulary-free, end-to-end video large language model framework. Our method introduces (1) a fine-grained hand motion text generator that explicitly models spatiotemporal dynamics, and (2) a contrastive alignment module integrated with HaMeR feature distillation to enhance hand-centric cross-modal representation learning, effectively mitigating modality discrepancies. The model is optimized via contrastive loss-based pretraining to improve alignment fidelity. Evaluated on Phoenix14T and CSL-Daily benchmarks, our approach achieves state-of-the-art performance—demonstrating substantial improvements in translation accuracy without lexical constraints. These results validate the effectiveness and generalizability of our framework for weakly supervised sign language understanding.
📝 Abstract
Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce extbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. extbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.