SignX: The Foundation Model for Sign Recognition

πŸ“… 2025-04-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Sign language recognition (SLR) faces two key challenges: the absence of a unified pose representation for American Sign Language (ASL) videos and inconsistent gloss annotations across datasets. To address these, we propose SignXβ€”the first foundational model framework tailored for SLR. Our approach comprises two core innovations: (1) a novel inverse-diffusion-driven Pose2Gloss module that enables end-to-end generation of standardized gloss tokens from latent pose representations; and (2) a ViT-based Video2Pose dual-stage architecture that unifies outputs from five state-of-the-art pose estimators (including SMPLer-X and DWPose) via latent-space alignment, establishing a canonical pose representation. Evaluated on standard ASL benchmarks, SignX significantly outperforms prior methods in gloss recognition accuracy. Moreover, it enables cross-dataset pose compatibility and zero-shot transfer, demonstrating robust generalization without task-specific fine-tuning.

Technology Category

Application Category

πŸ“ Abstract
The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID glosses, which serve to uniquely identify ASL signs. Note that there is no shared convention for assigning such glosses to ASL signs, so it is essential that the same glossing conventions are used for all of the data in the datasets that are employed. This paper proposes SignX, a foundation model framework for sign recognition. It is a concise yet powerful framework applicable to multiple human activity recognition scenarios. First, we developed a Pose2Gloss component based on an inverse diffusion model, which contains a multi-track pose fusion layer that unifies five of the most powerful pose information sources--SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation--into a single latent pose representation. Second, we trained a Video2Pose module based on ViT that can directly convert raw video into signer pose representation. Through this 2-stage training framework, we enable sign language recognition models to be compatible with existing pose formats, laying the foundation for the common pose estimation necessary for sign recognition. Experimental results show that SignX can recognize signs from sign language video, producing predicted gloss representations with greater accuracy than has been reported in prior work.
Problem

Research questions and friction points this paper is trying to address.

Recognizing ASL signs from RGB videos accurately
Unifying diverse pose sources into a single representation
Improving gloss prediction accuracy in sign language recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose2Gloss component based on inverse diffusion model
Video2Pose module using ViT for direct conversion
Multi-track pose fusion unifying five pose sources