MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in continuous American Sign Language (ASL) video recognition: inaccurate gesture boundary detection and poor robustness in sign recognition. To this end, we propose a handshape-aware multimodal boundary detection and recognition framework. Methodologically: (1) We jointly model 3D skeletal motion dynamics and fine-grained handshape features, incorporating linguistically grounded priors of 87 canonical handshapes; (2) we design a multimodal fusion module that integrates a pre-trained handshape classifier with temporal skeletal modeling; (3) we construct a large-scale, manually annotated continuous sign segmentation dataset, enabling joint training for both isolated and continuous signing scenarios. Evaluated on the ASLLRP benchmark, our approach achieves significant improvements in boundary detection accuracy and end-to-end recognition performance. The framework delivers enhanced interpretability—through explicit handshape modeling—and superior robustness, establishing a new paradigm for continuous sign language recognition.

Technology Category

Application Category

📝 Abstract
This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.
Problem

Research questions and friction points this paper is trying to address.

Detecting sign boundaries in continuous ASL videos using multimodal features
Incorporating 3D handshape information to improve boundary detection accuracy
Recognizing segmented signs from both isolated and continuous signing contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion of skeletal features and handshape data
Pretrained handshape classifier with 87 canonical categories
Boundary detection using convergence of sign dynamics