🤖 AI Summary
Sign language emotion recognition faces two major challenges: ambiguity between grammatical and affective facial expressions, and severe scarcity of annotated data. To address these, this work proposes a cross-lingual modeling framework grounded in the Japanese Sign Language (eJSL) and British Sign Language (BOBSL) datasets. Methodologically, it is the first to empirically validate the transferability of text-based sentiment models to sign language; introduces a key temporal segment selection strategy to mitigate motion redundancy; fuses facial and manual movement features—demonstrating that manual cues yield an +8.2% relative performance gain; and incorporates subtitle-driven weakly supervised pretraining. Evaluated on eJSL, the approach achieves high accuracy for seven emotion classes. It establishes a new state-of-the-art baseline for sign language emotion recognition, significantly outperforming mainstream spoken-language large language models.
📝 Abstract
Recognition of signers' emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.