🤖 AI Summary
Sign Language Recognition (SLR) suffers from coupled manual and non-manual signals, leading to annotation difficulty and weak temporal synchronization modeling. To address this, we propose the first LLM-based SLR framework, integrating Retrieval-Augmented Generation (RAG) for multi-granularity sign description generation—producing global, synonymous, and component-level textual outputs via multi-step prompting. We design a dual-encoder architecture to enable bidirectional probabilistic alignment between skeletal features and hierarchical textual representations, jointly optimized via multi-positive contrastive learning and KL divergence minimization. Evaluated on expert-validated corpora with hierarchical skeletal representations, our method achieves 97.1% and 97.07% accuracy on Chinese SLR500 and Turkish AUTSL benchmarks, respectively—setting new state-of-the-art results. The consistent performance across linguistically distinct sign languages demonstrates strong cross-lingual generalization, advancing practical deployment of accessible communication systems.
📝 Abstract
Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.