🤖 AI Summary
Continuous Sign Language Recognition (CSLR) suffers from scarce annotated data and coarse-grained annotations, exacerbating the modality gap between visual and linguistic representations. To address this, we propose a hierarchical alignment framework that integrates textual knowledge with visual representations. First, we explicitly model the temporal hierarchy of sign language via a sub-action tree structure. Second, we leverage large language models to extract fine-grained lexical semantics and employ cross-modal contrastive learning for progressive visual–textual alignment. Third, we introduce a multi-level contrastive loss to narrow the representational discrepancy across modalities. Evaluated on four benchmarks—PHOENIX-2014/T, CSL-Daily, and two Sign Language Gesture datasets—our method achieves substantial improvements in recognition accuracy, reducing word error rate (WER) by an average of 6.2%. It is the first end-to-end CSLR approach to incorporate lexical-level semantic guidance for alignment, establishing a scalable paradigm for low-resource sign language understanding.
📝 Abstract
Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.