🤖 AI Summary
Continuous Sign Language Recognition (CSLR) faces two major challenges: high inter-signer variability and poor generalization to unseen sentence structures. To address these, we propose a dual-architecture framework: (1) a signer-invariant Conformer that integrates convolutional and multi-head self-attention mechanisms to learn robust, cross-signer representations; and (2) a skeletal keypoint-based multi-scale fusion Transformer coupled with a dual-path temporal encoder, jointly modeling fine-grained pose dynamics and syntactic composition to enhance structural understanding. Evaluated on the Isharah-1000 dataset, our method achieves a word error rate (WER) of 13.07% on the signer-independent (SI) task—surpassing the state-of-the-art by 13.53%—and 47.78% on the unseen-sentence (US) task. In the SignEval 2025 Challenge, it ranks second and fourth in the SI and US tracks, respectively.
📝 Abstract
Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model's ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.