Sign Language Recognition in the Age of LLMs

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study presents the first systematic evaluation of the zero-shot generalization capabilities of general-purpose vision-language models (VLMs) in isolated sign language recognition (ISLR) within the era of large language models. Leveraging the WLASL300 benchmark, the authors compare multiple open-source and closed-source VLMs through zero-shot inference, prompt engineering, and multimodal alignment analysis. The results demonstrate that open-source VLMs significantly underperform compared to task-specific supervised models, whereas large closed-source models exhibit strong performance, confirming their partial visual-semantic alignment for sign language understanding. The findings highlight the critical roles of model scale and training data diversity in enabling effective sign language interpretation and offer new insights into the applicability of general-purpose multimodal models to low-resource visual tasks.

Technology Category

Application Category

📝 Abstract
Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Sign Language Recognition
Vision Language Models
Zero-shot Learning
Isolated Sign Language Recognition
Multimodal Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Models
Zero-shot Learning
Isolated Sign Language Recognition
Visual-Semantic Alignment
Model Scaling
🔎 Similar Papers
No similar papers found.
V
Vaclav Javorek
University of West Bohemia, Czech Republic; Eindhoven University of Technology, The Netherlands
J
Jakub Honzik
University of West Bohemia, Czech Republic
I
Ivan Gruber
University of West Bohemia, Czech Republic
T
Tomas Zelezny
University of West Bohemia, Czech Republic
Marek Hruz
Marek Hruz
University of West Bohemia
artificial intelligenceimage processingmachine learning