A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Continuous Sign Language Recognition (CSLR) faces two major challenges: high inter-signer variability and poor generalization to unseen sentence structures. To address these, we propose a dual-architecture framework: (1) a signer-invariant Conformer that integrates convolutional and multi-head self-attention mechanisms to learn robust, cross-signer representations; and (2) a skeletal keypoint-based multi-scale fusion Transformer coupled with a dual-path temporal encoder, jointly modeling fine-grained pose dynamics and syntactic composition to enhance structural understanding. Evaluated on the Isharah-1000 dataset, our method achieves a word error rate (WER) of 13.07% on the signer-independent (SI) task—surpassing the state-of-the-art by 13.53%—and 47.78% on the unseen-sentence (US) task. In the SignEval 2025 Challenge, it ranks second and fourth in the SI and US tracks, respectively.

Technology Category

Application Category

📝 Abstract
Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model's ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.
Problem

Research questions and friction points this paper is trying to address.

Addressing inter-signer variability in continuous sign language recognition
Improving generalization to unseen sentence structures in CSLR
Developing task-specific networks for CSLR performance improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Signer-Invariant Conformer with convolutions and self-attention
Multi-Scale Fusion Transformer with dual-path encoder
Task-specific networks for sign language recognition challenges
🔎 Similar Papers
No similar papers found.
M
Md Rezwanul Haque
Department of Electrical and Computer Engineering, University of Waterloo
Md. Milon Islam
Md. Milon Islam
University of Waterloo
Multimodal Machine LearningAI for HealthLarge Language Models
S M Taslim Uddin Raju
S M Taslim Uddin Raju
MASc in Computer Science (Specialized in AI)
Machine LearningMedical ImagingDeep LearningBiomedical Engineering.
F
Fakhri Karray
Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence