🤖 AI Summary
This study investigates the role of non-manual facial features—specifically eyes, mouth, and full face—in vision-based isolated-word automatic sign language recognition (ASLR). We quantitatively evaluate the contribution of each facial region to recognition performance using both CNN and Transformer architectures on standard benchmarks, complemented by qualitative analysis via saliency maps. Results demonstrate that incorporating facial features significantly improves accuracy, with the mouth region yielding the largest gain—substantially outperforming both eyes and full-face inputs. This work advances beyond prior coarse-grained comparisons (e.g., “hand-only” vs. “hand+full-face”) by introducing the first fine-grained attribution analysis of facial subregions in ASLR. It establishes that precise modeling of mouth dynamics is essential for robust ASLR performance and provides novel empirical grounding for designing multimodal representations tailored to sign language understanding.
📝 Abstract
Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.