🤖 AI Summary
The lack of high-quality benchmark datasets hinders progress in Italian Sign Language (LIS) recognition. Method: This work introduces SignIT—the first large-scale, fine-grained, multimodal LIS benchmark—comprising 644 videos (3.33 hours), 94 sign classes spanning five semantic categories, and synchronized 2D keypoint annotations for hands, face, and torso. We establish the first standardized evaluation protocol and systematically benchmark temporal models—including LSTM, Transformer, and multi-stream CNNs—on RGB, skeletal, and multimodal inputs. Contribution/Results: Experiments reveal limited performance of unimodal (RGB or keypoints-only) approaches; however, RGB–skeleton fusion significantly improves accuracy. Nevertheless, state-of-the-art models still exhibit substantial bottlenecks on authentic LIS data. SignIT is publicly released with a standardized evaluation framework and reproducible baselines, enabling rigorous advancement in sign language understanding research.
📝 Abstract
In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.