Hierarchical Sub-action Tree for Continuous Sign Language Recognition

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Continuous Sign Language Recognition (CSLR) suffers from scarce annotated data and coarse-grained annotations, exacerbating the modality gap between visual and linguistic representations. To address this, we propose a hierarchical alignment framework that integrates textual knowledge with visual representations. First, we explicitly model the temporal hierarchy of sign language via a sub-action tree structure. Second, we leverage large language models to extract fine-grained lexical semantics and employ cross-modal contrastive learning for progressive visual–textual alignment. Third, we introduce a multi-level contrastive loss to narrow the representational discrepancy across modalities. Evaluated on four benchmarks—PHOENIX-2014/T, CSL-Daily, and two Sign Language Gesture datasets—our method achieves substantial improvements in recognition accuracy, reducing word error rate (WER) by an average of 6.2%. It is the first end-to-end CSLR approach to incorporate lexical-level semantic guidance for alignment, establishing a scalable paradigm for low-resource sign language understanding.

Technology Category

Application Category

📝 Abstract
Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.
Problem

Research questions and friction points this paper is trying to address.

Lack of large datasets and precise annotations for CSLR
Ineffective utilization of gloss knowledge in cross-modal solutions
High computational complexity in aligning visual and textual modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Sub-action Tree for alignment
Leverages gloss knowledge from language models
Contrastive alignment enhances modality bridging
🔎 Similar Papers
No similar papers found.
Dejie Yang
Dejie Yang
Peking University
VLMRobot
Zhu Xu
Zhu Xu
Peking University
X
Xinjie Gao
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Y
Yang Liu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China; State Key Laboratory of General Artificial Intelligence, Peking Universitys, Beijing, China