🤖 AI Summary
To address the robustness deficiency in isolated sign language recognition (ISLR) caused by low-quality data and large intra-class variation in signing speed, this paper proposes an end-to-end transferable training framework. Methodologically: (1) it introduces an IoU-balanced classification loss jointly optimized with an auxiliary temporal regression head to explicitly model gesture onset/offset and structural dynamics; (2) it designs a sign-language-specific image-video joint augmentation strategy; and (3) it incorporates a lightweight temporal modeling module. The framework achieves state-of-the-art performance on WLASL and Slovo benchmarks. Moreover, its training strategy demonstrates strong generalization across datasets and model architectures. The core contributions lie in task-driven loss design—integrating structural temporal constraints into classification—and a multimodal augmentation mechanism, collectively enhancing ISLR’s adaptability to real-world variations in data quality, signing speed, and articulation style.
📝 Abstract
Accurate recognition and interpretation of sign language are crucial for enhancing communication accessibility for deaf and hard of hearing individuals. However, current approaches of Isolated Sign Language Recognition (ISLR) often face challenges such as low data quality and variability in gesturing speed. This paper introduces a comprehensive model training pipeline for ISLR designed to accommodate the distinctive characteristics and constraints of the Sign Language (SL) domain. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. Including an additional regression head combined with IoU-balanced classification loss enhances the model's awareness of the gesture and simplifies capturing temporal information. Extensive experiments demonstrate that the developed training pipeline easily adapts to different datasets and architectures. Additionally, the ablation study shows that each proposed component expands the potential to consider ISLR task specifics. The presented strategies enhance recognition performance across various ISLR benchmarks and achieve state-of-the-art results on the WLASL and Slovo datasets.