🤖 AI Summary
Accurate and efficient word-level Turkish Sign Language (TSL) recognition remains challenging for real-time assistive communication systems. Method: This paper reformulates sign language recognition as a sequence-to-sequence translation task, using only 3D skeletal coordinates of hands and torso extracted by MediaPipe. We propose a lightweight, temporally aware Transformer architecture specifically designed for skeletal sequence modeling, integrating linguistic constraints and strict real-time requirements. Contribution/Results: Evaluated on the AUTSL dataset (36,000+ samples, 227 vocabulary items), our method achieves state-of-the-art accuracy while significantly reducing model parameters. It attains sub-30ms inference latency per frame—meeting stringent mobile deployment requirements. The approach delivers a high-accuracy, low-latency, and production-ready solution for assistive communication systems supporting the Deaf and hard-of-hearing community.
📝 Abstract
This study presents TSLFormer, a light and robust word-level Turkish Sign Language (TSL) recognition model that treats sign gestures as ordered, string-like language. Instead of using raw RGB or depth videos, our method only works with 3D joint positions - articulation points - extracted using Google's Mediapipe library, which focuses on the hand and torso skeletal locations. This creates efficient input dimensionality reduction while preserving important semantic gesture information. Our approach revisits sign language recognition as sequence-to-sequence translation, inspired by the linguistic nature of sign languages and the success of transformers in natural language processing. Since TSLFormer uses the self-attention mechanism, it effectively captures temporal co-occurrence within gesture sequences and highlights meaningful motion patterns as words unfold. Evaluated on the AUTSL dataset with over 36,000 samples and 227 different words, TSLFormer achieves competitive performance with minimal computational cost. These results show that joint-based input is sufficient for enabling real-time, mobile, and assistive communication systems for hearing-impaired individuals.